Jessica J created MESOS-206:
-------------------------------

             Summary: Long-running jobs on Hadoop framework do not run to 
completion
                 Key: MESOS-206
                 URL: https://issues.apache.org/jira/browse/MESOS-206
             Project: Mesos
          Issue Type: Bug
          Components: framework
            Reporter: Jessica J
            Priority: Blocker


When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, 
the Hadoop jobs fail to complete. The MPI job, which is shorter, completes 
normally, and the Hadoop framework continues for a while, but eventually, 
although it appears to still be running, it stops making progress on the jobs. 
The jobtracker keeps running, but each line of output indicates no map or 
reduce tasks are actually being executed:

12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for 
[slavehost] with 0 map slots and 0 reduce slots

I've examined the master's log and noticed this:

I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 
201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317

The framework ID is that of the Hadoop framework. This message is followed by 
messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup 
framework 201206080825-36284608-5050-6311-0000."

I thought the first time that this error was a fluke since it does not happen 
with shorter running jobs or with the Hadoop framework running independently 
(i.e., no MPI), but I have now consistently reproduced it 4 times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to