[ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438838#comment-13438838 ]
Eli Reisman commented on GIRAPH-301: ------------------------------------ After more runs (and some thinking) I realized why I'm seeing some of these symptoms and why this speeds up the INPUT_SUPERSTEP so much. Please advise/correct if I'm wrong here. The idea is: The reason Locality and InputSplit reads are so fast is that the master writes the znodes once, and all other transactions are reads by workers. This is where ZK shines. The reason the reads on RESERVED and FINISHED lists are so slow is that those lists are also being concurrently written by workers all the time. This means the quorum must re-sync all the time, and all the readers (every worker) is dragging down the list as it reads, every iteration. Avoiding having the workers read FINISHED nodes to decide when to bail out of the reserveInputSplit() cycles and simply await the next superstep has two advantages. One, workers bail out of read cycles when all splits are RESERVED, so they get out of the loop sooner. In our current code, if a worker fails reading a split, the whole job goes down anyway. By avoiding excessive cycling, fewer readers crawl the list and the speed of that crawl improves. Even in a scenario where workers could restart, only the last one to scan the RESERVED list could fail and not have someone else find that split re-opened and claim it. My logs show many workers looping on the RESERVED list, and doing extra loops until in fact all RESERVED splits are also marked on the FINISHED list (often several more loops which are slooooow) and finally dropping off when they find all FINISHED. The mystery: then they STILL wait longer, sometimes much longer, while other workers continue to iterate the FINISHED list to be sure themselves. So why is this happening, all nodes are FINISHED, the master should signal the barrier and off we go, right? No. The master is ALSO still iterating, and is slow since most workers are still jostling in line to iterate the FINISHED list too. By having workers never read this list, you still have a lot of syncing going on as each worker marks its read splits FINISHED, but only ONE reader ever sees it -- the master! THIS is where I think a lot of the "end of the INPUT_SUPERSTEP" speedup is happening. The "beginning of the INPUT_SUPERSTEP" speedup is mostly due to the code in 301-5, which is also in this 301-6 patch, that simply places each worker at a different index as GIRAPH-250 did, but with locality also maintained. This has been logged to ensure that on the first split claimed by any worker, if you choose at command-line a 1-to-1 ratio of workers to splits, everyone gets their split claimed with only 1-2 reads to ZK, and then each worker does one last unavoidable loop on RESERVED list to see no more splits are available, and simply sleeps at the barrier. Combine that with eliminating the extra loops to check the FINISHED list before dropping out (from this patch), and you have the full speed up. Does this sound reasonable? It certainly explains what I see on trunk, 301-5, and 301-6 runs in my logs, and speed increases on the same data load and worker #'s over many runs on these 3 versions of Giraph. Any other ideas? More important: am I missing anything critical I just didn't tease out in testing as far as barrier dangers with this modification? I have searched the code for everywhere FINISHED and RESERVED znodes are messed with, and it looks good to me. > InputSplit Reservations are clumping, leaving many workers asleep while other > process too many splits and get overloaded. > ------------------------------------------------------------------------------------------------------------------------- > > Key: GIRAPH-301 > URL: https://issues.apache.org/jira/browse/GIRAPH-301 > Project: Giraph > Issue Type: Improvement > Components: bsp, graph, zookeeper > Affects Versions: 0.2.0 > Reporter: Eli Reisman > Assignee: Eli Reisman > Labels: patch > Fix For: 0.2.0 > > Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, > GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch > > > With recent additions to the codebase, users here have noticed many workers > are able to load input splits extremely quickly, and this has altered the > behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm > for split reservations. A few workers process multiple splits (often > overwhelming Netty and getting GC errors as they attempt to offload too much > data too quick) while many (often most) of the others just sleep through the > superstep, never successfully participating at all. > Essentially, the current algo is: > 1. scan input split list, skipping nodes that are marked "Finsihed" > 2. grab the first unfinished node in the list (reserved or not) and check its > reserved status. > 3. if not reserved, attempt to reserve & return it if successful. > 4. if the first one you check is already taken, sleep for way too long and > only wake up if another worker finishes a split, then contend with that > worker for another split, while the majority of the split list might sit > idle, not actually checked or claimed by anyone yet. > This does not work. By making a few simple changes (and acknowledging that ZK > reads are cheap, only writes are not) this patch is able to get every worker > involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes > quickly and painlessly, and without overwhelming Netty by spreading the > memory load the split readers bear more evenly. If the giraph.splitmb and -w > options are set correctly, behavior is now exactly as one would expect it to > be. > This also results in INPUT_SUPERSTEP passing more quickly, and survive the > INPUT_SUPERSTEP for a given data load on less Hadoop memory slots. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira