[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431218#comment-13431218
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

Yeah the marriage patch is crapping out on me so far :(. 

Jaeho, don't apologize, its not your fault or your problem, this is Hadoop and 
Giraph not getting along. I am in no way trying to assume I can guess why this 
is happening or why the timeouts still occur. Giraph's points of contact with 
Hadoop are pain points sometimes, there's lots to do around interfacing better 
with the Hadoop infrastructure. I think you're going to be very satisfied when 
you apply this sort of thinking to other parts of Giraph and the code will run 
beautifully. Its not all like this, I swear!

So 246-8 and 246-9 are probably suspect. I think 246-7 is that last rebase of 
the revert code that I got to run, but I need to verify it. My goal at this 
point is getting the timeouts to disappear while we open a window to solve this 
problem without racing the clock. I'm willing to attempt tests on any/all 
solution patches people have today, so I'm taking numbers now, speak up!

I should not have tried to figure out the predicate lock solution myself at the 
last minute, but everyone wants to keep that code in and I want to stop the 
timeouts, and I was hoping we could have our cake and eat it too. If the 
solution is leaving this alone and I patch in the old patch and keep rebasing 
it for a while for the users here, thats perfectly fine with me, I'm just glad 
I we can start testing application code (and extending the scale out!) why 
don't you guys decide how to move forward with this and I'll work around it as 
I need to. If you decide to patch in some part of this code and need me to 
clean it up, I'm happy to do that too.

I wish everyone on this project had the opportunity I've had to really ramp 
this thing up the last few months on a big cluster and see what it can do. If I 
told you I'd have kill you, but you'd die smiling. :)

As I told Jakob recently, most of the "bottlenecks" I have discovered while 
attempting scale out have been bug fixes not overhauls, I think you would be 
really proud to know how close this thing is to being a powerful bulk 
processing tool right now today. It hasn't been hard for me to evangelize about 
this project around here, people are ready for a solution like this. Its very 
exciting stuff.

In short, this sort of messy frustration with Giraph is the exception not the 
norm in my mind, and I hope for Jaeho or anyone new getting involved they will 
recognize that. Its no accident we are out of incubator, this thing is no toy. 
Kudos to all of you.

                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to