[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Jessica J (JIRA) Thu, 02 Aug 2012 08:40:04 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427393#comment-13427393
 ]


Jessica J commented on MESOS-206:
---------------------------------

Some new messages in the JobTracker log I'm seeing with the latest code:

12/08/02 10:05:20 INFO mapred.TaskInProgress: Error from 
attempt_201208020749_0150_r_000047_0: java.lang.IllegalArgumentException: Null 
user
    at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:762)
    at org.apache.hadoop.mapred.Child.main(Child.java:241)

..........................................................................................................................................................................................................................
(4 minutes later)

12/08/02 10:09:47 WARN mapred.FrameworkScheduler: SchedulerDriver returned 
irregular status: DRIVER_ABORTED

All other log activity/error messages remain the same as previously commented. 
The "irregular status" message comes from resourceOffers(SchedulerDriver d, 
List<Offer> offers) in FrameworkScheduler.java (line 245) after attempting to 
launch tasks. If I've followed the code path correctly, I've traced this call 
to line 172 (JNIExecutor::launchTask) in 
org_apache_mesos_MesosExecutorDriver.cpp, where it appears a Java exception is 
causing the driver to abort. The exception that occurs closest to the time of 
DRIVER_ABORTED is the first status update for an unknown task:

Exception in thread "Thread-150424" java.lang.RuntimeException: Received status 
update for unknown task value: "56357"

Any ideas how all this is inter-related? And what's causing it?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running 
> jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, 
> completes normally, and the Hadoop framework continues for a while, but 
> eventually, although it appears to still be running, it stops making progress 
> on the jobs. The jobtracker keeps running, but each line of output indicates 
> no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for 
> [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 
> 201206080825-36284608-5050-6311-0000 as requested by 
> scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by 
> messages indicating the slaves "couldn't lookup task [#]" and "couldn't 
> lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen 
> with shorter running jobs or with the Hadoop framework running independently 
> (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without 
> the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Reply via email to