[ 
https://issues.apache.org/jira/browse/GIRAPH-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425016#comment-13425016
 ] 

Eli Reisman commented on GIRAPH-267:
------------------------------------

Still can't get this to run without timing out. Sorry I was not clear above, I 
have been using the same testing parameters I always do, but what I meant above 
it you can try to trick Giraph into taking a bit more data before timing out by 
having a small giraph.splitmb (making HDFS unhappy) and many many workers so 
everyone gets a small input chunk to read, and try to get more data in before 
timeout occurs. This is not a good trick for realistic scale out so I tend to 
avoid it, but I was trying to see how far this patch would carry load-wise 
given the timeout barrier of 600 seconds. 

This trick let me get to 40% of my previous loads before blowing up, it would 
be considerably less if I ran Giraph on this patch with more appropriate # of 
workers vs. splitmb size. At 600 seconds, workers still drop off and the job 
dies. Has anyone else run this at scale on real data, and what is your recipe 
for getting past the 600 second timeout? I can't actually run more than 25% of 
my old data loads using normal settings and still finish a job.
                
> Jobs can get killed for not reporting status during INPUT SUPERSTEP
> -------------------------------------------------------------------
>
>                 Key: GIRAPH-267
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-267
>             Project: Giraph
>          Issue Type: Bug
>          Components: graph
>    Affects Versions: 0.2.0
>         Environment: Facebook Hadoop
>            Reporter: Jaeho Shin
>            Assignee: Jaeho Shin
>             Fix For: 0.2.0
>
>         Attachments: 
> 0001-Made-PredicateLock-report-progress-and-removed-Conte.patch, 
> GIRAPH-267.patch, GIRAPH-267.patch
>
>
> Job with a skewed and long (>600secs in my case) INPUT_SUPERSTEP fails for 
> some tasks not reporting their status.  From BspServiceWorker#setup(), I 
> could tell while some workers were still loading inputSplits, others finished 
> theirs early and hanged on PredicateLock#waitForever(), and got killed after 
> the timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to