[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Jessica J (JIRA) Fri, 27 Jul 2012 10:07:37 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423976#comment-13423976
 ]


Jessica J commented on MESOS-206:
---------------------------------

OK, so now that the leap second issue is out of the way (which makes the 
excessive resource usage irrelevant), my Hadoop jobs are still failing to 
complete. (Essentially, everything except my last four comments still apply.) 
Basically, when I first set up a cluster, I can run through my large job a 
couple times successfully. (I'm doing benchmarks, so I'm running each job 
multiple times.) On my third try, however, the job fails and the 
above-mentioned issues come into play (namely, the framework disconnecting and 
task #s no longer being recognized.) Is Mesos not completely releasing 
resources (e.g., file handles) that it should be?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running 
> jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, 
> completes normally, and the Hadoop framework continues for a while, but 
> eventually, although it appears to still be running, it stops making progress 
> on the jobs. The jobtracker keeps running, but each line of output indicates 
> no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for 
> [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 
> 201206080825-36284608-5050-6311-0000 as requested by 
> scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by 
> messages indicating the slaves "couldn't lookup task [#]" and "couldn't 
> lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen 
> with shorter running jobs or with the Hadoop framework running independently 
> (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without 
> the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Reply via email to