[ 
https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438838#comment-13438838
 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

After more runs (and some thinking) I realized why I'm seeing some of these 
symptoms and why this speeds up the INPUT_SUPERSTEP so much. Please 
advise/correct if I'm wrong here. The idea is:

The reason Locality and InputSplit reads are so fast is that the master writes 
the znodes once, and all other transactions are reads by workers. This is where 
ZK shines.

The reason the reads on RESERVED and FINISHED lists are so slow is that those 
lists are also being concurrently written by workers all the time. This means 
the quorum must re-sync all the time, and all the readers (every worker) is 
dragging down the list as it reads, every iteration.

Avoiding having the workers read FINISHED nodes to decide when to bail out of 
the reserveInputSplit() cycles and simply await the next superstep has two 
advantages. One, workers bail out of read cycles when all splits are RESERVED, 
so they get out of the loop sooner. In our current code, if a worker fails 
reading a split, the whole job goes down anyway. By avoiding excessive cycling, 
fewer readers crawl the list and the speed of that crawl improves. Even in a 
scenario where workers could restart, only the last one to scan the RESERVED 
list could fail and not have someone else find that split re-opened and claim 
it.

My logs show many workers looping on the RESERVED list, and doing extra loops 
until in fact all RESERVED splits are also marked on the FINISHED list (often 
several more loops which are slooooow) and finally dropping off when they find 
all FINISHED. The mystery: then they STILL wait longer, sometimes much longer, 
while other workers continue to iterate the FINISHED list to be sure 
themselves. So why is this happening, all nodes are FINISHED, the master should 
signal the barrier and off we go, right?

No. The master is ALSO still iterating, and is slow since most workers are 
still jostling in line to iterate the FINISHED list too. By having workers 
never read this list, you still have a lot of syncing going on as each worker 
marks its read splits FINISHED, but only ONE reader ever sees it -- the master! 
THIS is where I think a lot of the "end of the INPUT_SUPERSTEP" speedup is 
happening.

The "beginning of the INPUT_SUPERSTEP" speedup is mostly due to the code in 
301-5, which is also in this 301-6 patch, that simply places each worker at a 
different index as GIRAPH-250 did, but with locality also maintained. This has 
been logged to ensure that on the first split claimed by any worker, if you 
choose at command-line a 1-to-1 ratio of workers to splits, everyone gets their 
split claimed with only 1-2 reads to ZK, and then each worker does one last 
unavoidable loop on RESERVED list to see no more splits are available, and 
simply sleeps at the barrier. Combine that with eliminating the extra loops to 
check the FINISHED list before dropping out (from this patch), and you have the 
full speed up.

Does this sound reasonable? It certainly explains what I see on trunk, 301-5, 
and 301-6 runs in my logs, and speed increases on the same data load and worker 
#'s over many runs on these 3 versions of Giraph. Any other ideas? 

More important: am I missing anything critical I just didn't tease out in 
testing as far as barrier dangers with this modification? I have searched the 
code for everywhere FINISHED and RESERVED znodes are messed with, and it looks 
good to me.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other 
> process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, 
> GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers 
> are able to load input splits extremely quickly, and this has altered the 
> behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm 
> for split reservations. A few workers process multiple splits (often 
> overwhelming Netty and getting GC errors as they attempt to offload too much 
> data too quick) while many (often most) of the others just sleep through the 
> superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its 
> reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and 
> only wake up if another worker finishes a split, then contend with that 
> worker for another split, while the majority of the split list might sit 
> idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK 
> reads are cheap, only writes are not) this patch is able to get every worker 
> involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes 
> quickly and painlessly, and without overwhelming Netty by spreading the 
> memory load the split readers bear more evenly. If the giraph.splitmb and -w 
> options are set correctly, behavior is now exactly as one would expect it to 
> be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the 
> INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to