[ https://issues.apache.org/jira/browse/MAPREDUCE-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sam liu updated MAPREDUCE-4490: ------------------------------- Attachment: MAPREDUCE-4490.patch New patch basing on latest branch origin/branch-1.2 > JVM reuse is incompatible with LinuxTaskController (and therefore > incompatible with Security) > --------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-4490 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4490 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task-controller, tasktracker > Affects Versions: 0.20.205.0, 1.0.3, 1.2.1 > Reporter: George Datskos > Assignee: sam liu > Priority: Critical > Labels: patch > Fix For: 1.2.1 > > Attachments: MAPREDUCE-4490.patch, MAPREDUCE-4490.patch, > MAPREDUCE-4490.patch > > > When using LinuxTaskController, JVM reuse (mapred.job.reuse.jvm.num.tasks > > 1) with more map tasks in a job than there are map slots in the cluster will > result in immediate task failures for the second task in each JVM (and then > the JVM exits). We have investigated this bug and the root cause is as > follows. When using LinuxTaskController, the userlog directory for a task > attempt (../userlogs/job/task-attempt) is created only on the first > invocation (when the JVM is launched) because userlogs directories are > created by the task-controller binary which only runs *once* per JVM. > Therefore, attempting to create log.index is guaranteed to fail with ENOENT > leading to immediate task failure and child JVM exit. > {quote} > 2012-07-24 14:29:11,914 INFO org.apache.hadoop.mapred.TaskLog: Starting > logging for a new task attempt_201207241401_0013_m_000027_0 in the same JVM > as that of the first task > /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_000006_0 > 2012-07-24 14:29:11,915 WARN org.apache.hadoop.mapred.Child: Error running > child > ENOENT: No such file or directory > at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) > at > org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) > at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) > at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) > at org.apache.hadoop.mapred.Child.main(Child.java:229) > {quote} > The above error occurs in a JVM which runs tasks 6 and 27. Task6 goes > smoothly. Then Task27 starts. The directory > /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_0000027_0 > is never created so when mapred.Child tries to write the log.index file for > Task27, it fails with ENOENT because the > attempt_201207241401_0013_m_0000027_0 directory does not exist. Therefore, > the second task in each JVM is guaranteed to fail (and then the JVM exits) > every time when using LinuxTaskController. Note that this problem does not > occur when using the DefaultTaskController because the userlogs directories > are created for each task (not just for each JVM as with LinuxTaskController). > For each task, the TaskRunner calls the TaskController's createLogDir method > before attempting to write out an index file. > * DefaultTaskController#createLogDir: creates log directory for each task > * LinuxTaskController#createLogDir: does nothing > ** task-controller binary creates log directory [create_attempt_directories] > (but only for the first task) > Possible Solution: add a new command to task-controller *initialize task* to > create attempt directories. Call that command, with ShellCommandExecutor, in > the LinuxTaskController#createLogDir method -- This message was sent by Atlassian JIRA (v6.1.5#6160)