[jira] [Commented] (MAPREDUCE-3859) CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs
[ https://issues.apache.org/jira/browse/MAPREDUCE-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207611#comment-13207611 ] Sergey Tryuber commented on MAPREDUCE-3859: --- Sorry, I don't know what exactly version of Hadoop is used in cdh3u1 distribution. There we have following lines in *CapacitySchedulerQueue.java* in *assignSlotsToJob* method: {code} int queueSlotsOccupied = getNumSlotsOccupied(taskType); int currentCapacity; if (queueSlotsOccupied queueCapacity) { currentCapacity = queueCapacity; } else { currentCapacity = queueSlotsOccupied + numSlotsRequested; } {code} Imagine we have a job with 1 slot per task, if we have queue with 10 configured capacity and 9 occupied slots (imagine, we have large maximum capacity and a lot of free slots on cluster), then _currentCapacity=10_ and task will be scheduled properly. Later, when will have 10 occupied slots, _currentCapacity=11_ and all will be fine too. And so on... Now imagine, we have a job with 3 slots per task, if we have queue with 10 configured capacity and 9 occupied slots, then _currentCapacity=10_, but that's not enough for scheduling this new task!!! So, this job will never use more then 9 slots! I've fixed this problem by changing: {code} if (queueSlotsOccupied queueCapacity) { {code} on {code} if (queueSlotsOccupied + numSlotsRequested = queueCapacity) { {code} I've rebuilt cdh3u1 from sources, deployed jar on the cluster and CapacityScheduler works well now for me. Also I've checkouted current Hadoop's trunk. Unfortunately, sources of CapacityScheduler dramatically changed. But I've found the similar lines in *LeafQueue.java* in *computeUserLimit* method: {code} final int currentCapacity = (consumed queueCapacity) ? queueCapacity : (consumed + required.getMemory()); {code} So, it seems to me, this bug also affects the latest CapacityScheduler CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs Key: MAPREDUCE-3859 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3859 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/capacity-sched Environment: CDH3u1 Reporter: Sergey Tryuber Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-3859) CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs
CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs Key: MAPREDUCE-3859 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3859 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/capacity-sched Environment: CDH3u1 Reporter: Sergey Tryuber Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3837) Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207663#comment-13207663 ] Hudson commented on MAPREDUCE-3837: --- Integrated in Hadoop-Hdfs-trunk #955 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/]) MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243695 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job. Key: MAPREDUCE-3837 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3837 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.22.0 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 0.24.0, 0.22.1, 0.23.2 Attachments: PATCH-MAPREDUCE-3837.patch, PATCH-TRUNK-MAPREDUCE-3837.patch If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job. However the current behavior is as follows jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207664#comment-13207664 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Hdfs-trunk #955 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/955/]) MAPREDUCE-3846. Addressed MR AM hanging issues during AM restart and then the recovery. (vinodkv) (Revision 1243752) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243752 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/MapTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/ReduceTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/Recovery.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/TypeConverter.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3837) Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207670#comment-13207670 ] Hudson commented on MAPREDUCE-3837: --- Integrated in Hadoop-Hdfs-0.23-Build #168 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/]) MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243698) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243698 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job. Key: MAPREDUCE-3837 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3837 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.22.0 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 0.24.0, 0.22.1, 0.23.2 Attachments: PATCH-MAPREDUCE-3837.patch, PATCH-TRUNK-MAPREDUCE-3837.patch If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job. However the current behavior is as follows jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207671#comment-13207671 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Hdfs-0.23-Build #168 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/168/]) MAPREDUCE-3846. Addressed MR AM hanging issues during AM restart and then the recovery. (vinodkv) svn merge --ignore-ancestry -c 1243752 ../../trunk/ (Revision 1243755) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243755 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/MapTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/ReduceTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/Recovery.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/TypeConverter.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3859) CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs
[ https://issues.apache.org/jira/browse/MAPREDUCE-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated MAPREDUCE-3859: --- Target Version/s: 1.1.0 Affects Version/s: 1.0.0 This affects 1.x as well, comparing the difference in commits to CS between CDH3's CS and Apache 1.x. CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs Key: MAPREDUCE-3859 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3859 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/capacity-sched Affects Versions: 1.0.0 Environment: CDH3u1 Reporter: Sergey Tryuber Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3859) CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs
[ https://issues.apache.org/jira/browse/MAPREDUCE-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207680#comment-13207680 ] Harsh J commented on MAPREDUCE-3859: Sergey, Thanks for taking a dig at the code and coming up with a fix! Would you be interested in posting a patch fix for this as well? We'd require a test case that fails without the fix as well. Let us know if thats possible, thanks again! CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs Key: MAPREDUCE-3859 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3859 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/capacity-sched Affects Versions: 1.0.0 Environment: CDH3u1 Reporter: Sergey Tryuber Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207711#comment-13207711 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Mapreduce-0.23-Build #196 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/196/]) MAPREDUCE-3846. Addressed MR AM hanging issues during AM restart and then the recovery. (vinodkv) svn merge --ignore-ancestry -c 1243752 ../../trunk/ (Revision 1243755) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243755 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/MapTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/ReduceTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/Recovery.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/TypeConverter.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3837) Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207716#comment-13207716 ] Hudson commented on MAPREDUCE-3837: --- Integrated in Hadoop-Mapreduce-trunk #990 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/]) MAPREDUCE-3837. Job tracker is not able to recover jobs after crash. Contributed by Mayank Bansal. (Revision 1243695) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243695 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/src/java/org/apache/hadoop/mapred/JobTracker.java Hadoop 22 Job tracker is not able to recover job in case of crash and after that no user can submit job. Key: MAPREDUCE-3837 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3837 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.22.0 Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 0.24.0, 0.22.1, 0.23.2 Attachments: PATCH-MAPREDUCE-3837.patch, PATCH-TRUNK-MAPREDUCE-3837.patch If job tracker is crashed while running , and there were some jobs are running , so if job tracker's property mapreduce.jobtracker.restart.recover is true then it should recover the job. However the current behavior is as follows jobtracker try to restore the jobs but it can not . And after that jobtracker closes its handle to hdfs and nobody else can submit job. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207717#comment-13207717 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Mapreduce-trunk #990 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/990/]) MAPREDUCE-3846. Addressed MR AM hanging issues during AM restart and then the recovery. (vinodkv) (Revision 1243752) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1243752 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/MapTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/ReduceTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/Recovery.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/TypeConverter.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3825) MR should not be getting duplicate tokens for a MR Job.
[ https://issues.apache.org/jira/browse/MAPREDUCE-3825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207767#comment-13207767 ] Daryn Sharp commented on MAPREDUCE-3825: If there are external multi-token filesystems, that currently work, then they have implemented {{getDelegationTokens(renewer, creds)}}. Those filesystems will continue to work so long as {{getDelegationTokens(renewer, creds)}} isn't marked {{final}} as also proposed. W/o {{final}}, the proposal is completely backwards compatible. MR should not be getting duplicate tokens for a MR Job. --- Key: MAPREDUCE-3825 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3825 Project: Hadoop Map/Reduce Issue Type: Bug Components: security Affects Versions: 0.23.1, 0.24.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Attachments: MAPREDUCE-3825.patch, TokenCache.pdf This is the counterpart to HADOOP-7967. MR gets tokens for all input, output and the default filesystem when a MR job is submitted. The APIs in FileSystem make it challenging to avoid duplicate tokens when there are file systems that have embedded filesystems. Here is the original description that Daryn wrote: The token cache currently tries to assume a filesystem's token service key. The assumption generally worked while there was a one to one mapping of filesystem to token. With the advent of multi-token filesystems like viewfs, the token cache will try to use a service key (ie. for viewfs) that will never exist (because it really gets the mounted fs tokens). The descriop -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207811#comment-13207811 ] Robert Joseph Evans commented on MAPREDUCE-3802: +1 you are correct I manually verified that 0.23.2 does not show this problem any more, and so I assume that it is MAPREDUCE-3846 that fixed it. If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3348) mapred job -status fails to give info even if the job is present in History
[ https://issues.apache.org/jira/browse/MAPREDUCE-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated MAPREDUCE-3348: - Attachment: MAPREDUCE-3348.patch mapred job -status fails to give info even if the job is present in History --- Key: MAPREDUCE-3348 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3348 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.24.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-3348.patch It is trying to get the app report from the RM for the job, RM throws exception when it doesn't find and then it is giving the same exception without trying from History Server. {code} 11/11/03 08:47:27 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapred uce.v2.api.MRClientProtocol 11/11/03 08:47:28 WARN mapred.ClientServiceDelegate: Exception thrown by remote end. RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) Exception in thread main RemoteTrace: at Local Trace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3348) mapred job -status fails to give info even if the job is present in History
[ https://issues.apache.org/jira/browse/MAPREDUCE-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated MAPREDUCE-3348: - Status: Patch Available (was: Open) mapred job -status fails to give info even if the job is present in History --- Key: MAPREDUCE-3348 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3348 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.24.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-3348.patch It is trying to get the app report from the RM for the job, RM throws exception when it doesn't find and then it is giving the same exception without trying from History Server. {code} 11/11/03 08:47:27 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapred uce.v2.api.MRClientProtocol 11/11/03 08:47:28 WARN mapred.ClientServiceDelegate: Exception thrown by remote end. RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) Exception in thread main RemoteTrace: at Local Trace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3348) mapred job -status fails to give info even if the job is present in History
[ https://issues.apache.org/jira/browse/MAPREDUCE-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated MAPREDUCE-3348: - Component/s: mrv2 mapred job -status fails to give info even if the job is present in History --- Key: MAPREDUCE-3348 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3348 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.24.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-3348.patch It is trying to get the app report from the RM for the job, RM throws exception when it doesn't find and then it is giving the same exception without trying from History Server. {code} 11/11/03 08:47:27 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapred uce.v2.api.MRClientProtocol 11/11/03 08:47:28 WARN mapred.ClientServiceDelegate: Exception thrown by remote end. RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) Exception in thread main RemoteTrace: at Local Trace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3348) mapred job -status fails to give info even if the job is present in History
[ https://issues.apache.org/jira/browse/MAPREDUCE-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207851#comment-13207851 ] Hadoop QA commented on MAPREDUCE-3348: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514501/MAPREDUCE-3348.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1853//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1853//console This message is automatically generated. mapred job -status fails to give info even if the job is present in History --- Key: MAPREDUCE-3348 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3348 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.24.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-3348.patch It is trying to get the app report from the RM for the job, RM throws exception when it doesn't find and then it is giving the same exception without trying from History Server. {code} 11/11/03 08:47:27 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapred uce.v2.api.MRClientProtocol 11/11/03 08:47:28 WARN mapred.ClientServiceDelegate: Exception thrown by remote end. RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:353) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:429) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:186) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1106) Exception in thread main RemoteTrace: at Local Trace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Trying to get information for an absent applicat ion application_1320278804241_0002 at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142) at $Proxy6.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getApplicationReport(ClientRMProtocolPBClie ntImpl.java:111) at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:321) at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:137) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273) at
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207883#comment-13207883 ] Aaron T. Myers commented on MAPREDUCE-3858: --- +1 (non-binding.) I tested this patch on a cluster over night and didn't experience this issue at all, whereas without this patch I hit this problem twice in a 12 hour period under the same load. Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207903#comment-13207903 ] Mahadev konar commented on MAPREDUCE-3858: -- Good patch to go into 0.23.1 as well. Ill go ahead and commit this to all the 3 branches. Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3802: --- Resolution: Fixed Release Note: Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks for the review/verification, Robert! I just committed this to trunk, 0.23 and 0.23.1. If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3838) MapReduce job submission time has increased in 0.23 when compared to 0.20.206
[ https://issues.apache.org/jira/browse/MAPREDUCE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207943#comment-13207943 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-3838: Amar, need more information on this. How did you measure the job submit time exactly? MapReduce job submission time has increased in 0.23 when compared to 0.20.206 - Key: MAPREDUCE-3838 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3838 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.24.0 Reporter: Amar Kamat Labels: gridmix, job-submit-time, yarn Fix For: 0.23.1, 0.24.0 While running Gridmix on 0.23, we found that the job submission time has increased when compared to 0.20.206. Here are some stats: ||Submit-Time||Total number of jobs in YARN|| Total number of jobs in FRED|| | 25secs|3 |1 | | 20secs| 6 | 2 | | 15secs| 14 | 4 | | 10secs| 24 | 4 | | 5secs | 67 | 28| Note that Gridmix was run using the same trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3838) MapReduce job submission time has increased in 0.23 when compared to 0.20.206
[ https://issues.apache.org/jira/browse/MAPREDUCE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3838: --- Target Version/s: 0.23.2 (was: 0.24.0) Affects Version/s: (was: 0.24.0) 0.23.0 Fix Version/s: (was: 0.23.1) (was: 0.24.0) Setting a tentative target version of 0.23.2. MapReduce job submission time has increased in 0.23 when compared to 0.20.206 - Key: MAPREDUCE-3838 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3838 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.0 Reporter: Amar Kamat Labels: gridmix, job-submit-time, yarn While running Gridmix on 0.23, we found that the job submission time has increased when compared to 0.20.206. Here are some stats: ||Submit-Time||Total number of jobs in YARN|| Total number of jobs in FRED|| | 25secs|3 |1 | | 20secs| 6 | 2 | | 15secs| 14 | 4 | | 10secs| 24 | 4 | | 5secs | 67 | 28| Note that Gridmix was run using the same trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3838) MapReduce job submission time has increased in 0.23 when compared to 0.20.206
[ https://issues.apache.org/jira/browse/MAPREDUCE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3838: --- Issue Type: Sub-task (was: Bug) Parent: MAPREDUCE-3561 MapReduce job submission time has increased in 0.23 when compared to 0.20.206 - Key: MAPREDUCE-3838 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3838 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: client Affects Versions: 0.23.0 Reporter: Amar Kamat Labels: gridmix, job-submit-time, yarn While running Gridmix on 0.23, we found that the job submission time has increased when compared to 0.20.206. Here are some stats: ||Submit-Time||Total number of jobs in YARN|| Total number of jobs in FRED|| | 25secs|3 |1 | | 20secs| 6 | 2 | | 15secs| 14 | 4 | | 10secs| 24 | 4 | | 5secs | 67 | 28| Note that Gridmix was run using the same trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207953#comment-13207953 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Hdfs-0.23-Commit #537 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/537/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207954#comment-13207954 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Hdfs-0.23-Commit #537 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/537/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207959#comment-13207959 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Common-0.23-Commit #549 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/549/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207958#comment-13207958 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Common-0.23-Commit #549 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/549/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207963#comment-13207963 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Hdfs-trunk-Commit #1800 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1800/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207966#comment-13207966 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Hdfs-trunk-Commit #1800 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1800/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207970#comment-13207970 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Common-trunk-Commit #1726 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1726/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207974#comment-13207974 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Common-trunk-Commit #1726 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1726/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207987#comment-13207987 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Mapreduce-0.23-Commit #553 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/553/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = ABORTED vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207989#comment-13207989 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Mapreduce-0.23-Commit #553 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/553/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) svn merge --ignore-ancestry -c 1244178 ../../trunk/ (Revision 1244180) Result = ABORTED vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244180 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207992#comment-13207992 ] Hudson commented on MAPREDUCE-3802: --- Integrated in Hadoop-Mapreduce-trunk-Commit #1737 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1737/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = ABORTED vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3802-20120213.txt, MAPREDUCE-3802-20120213.txt, syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3583) ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException
[ https://issues.apache.org/jira/browse/MAPREDUCE-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated MAPREDUCE-3583: -- Hadoop Flags: Reviewed +1 patch looks good. Thanks a lot! ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException - Key: MAPREDUCE-3583 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3583 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.205.0 Environment: 64-bit Linux: asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: mapreduce-3583-v2.txt, mapreduce-3583-v3.txt, mapreduce-3583-v4.txt, mapreduce-3583-v5.txt, mapreduce-3583.txt HBase PreCommit builds frequently gave us NumberFormatException. From https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/org.apache.hadoop.hbase.mapreduce/TestHFileOutputFormat/testMRIncrementalLoad/: {code} 2011-12-20 01:44:01,180 WARN [main] mapred.JobClient(784): No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). java.lang.NumberFormatException: For input string: 18446743988060683582 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:422) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413) at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148) at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401) at org.apache.hadoop.mapred.Task.initialize(Task.java:536) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} From hadoop 0.20.205 source code, looks like ppid was 18446743988060683582, causing NFE: {code} // Set (name) (ppid) (pgrpId) (session) (utime) (stime) (vsize) (rss) pinfo.updateProcessInfo(m.group(2), Integer.parseInt(m.group(3)), {code} You can find information on the OS at the beginning of https://builds.apache.org/job/PreCommit-HBASE-Build/553/console: {code} asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 6 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2048 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 6 Running in Jenkins mode {code} From Nicolas Sze: {noformat} It looks like that the ppid is a 64-bit positive integer but Java long is signed and so only works with 63-bit positive integers. In your case, 2^64 18446743988060683582 2^63. Therefore, there is a NFE. {noformat} I propose changing allProcessInfo to MapString, ProcessInfo so that we don't encounter this problem by avoiding parsing large integer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207990#comment-13207990 ] Hudson commented on MAPREDUCE-3846: --- Integrated in Hadoop-Mapreduce-trunk-Commit #1737 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1737/]) MAPREDUCE-3802. Added test to validate that AM can crash multiple times and still can recover successfully after MAPREDUCE-3846. (vinodkv) (Revision 1244178) Result = ABORTED vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244178 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120210.txt, MAPREDUCE-3846-20120213.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3761) AM info in job -list does not reflect the actual AM hostname
[ https://issues.apache.org/jira/browse/MAPREDUCE-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3761: --- Attachment: MAPREDUCE-3761-20120214.1.txt Fixing the test. AM info in job -list does not reflect the actual AM hostname Key: MAPREDUCE-3761 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3761 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 0.23.1 Reporter: Ramya Sunil Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.1 Attachments: MAPREDUCE-3761-20120202.txt, MAPREDUCE-3761-20120214.1.txt The AM info field on bin/mapred job -list currently has a value resourcemanager hostname:8088/proxy/appID. This info is irrelevant unless it shows the real information of where the AM was launched. This needs to be fixed to show the AM host details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3761) AM info in job -list does not reflect the actual AM hostname
[ https://issues.apache.org/jira/browse/MAPREDUCE-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3761: --- Status: Patch Available (was: Open) AM info in job -list does not reflect the actual AM hostname Key: MAPREDUCE-3761 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3761 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 0.23.1 Reporter: Ramya Sunil Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.1 Attachments: MAPREDUCE-3761-20120202.txt, MAPREDUCE-3761-20120214.1.txt The AM info field on bin/mapred job -list currently has a value resourcemanager hostname:8088/proxy/appID. This info is irrelevant unless it shows the real information of where the AM was launched. This needs to be fixed to show the AM host details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3812) Change default memory slot sizes to be 1.5GB
[ https://issues.apache.org/jira/browse/MAPREDUCE-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3812: --- Fix Version/s: (was: 0.23.1) 0.23.2 Status: Open (was: Patch Available) This patch needs more work. Moving to 0.23.2. Change default memory slot sizes to be 1.5GB Key: MAPREDUCE-3812 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3812 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2, performance Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.2 Attachments: MAPREDUCE-3812-20120205.txt, MAPREDUCE-3812-20120206.1.txt, MAPREDUCE-3812-20120206.txt, MAPREDUCE-3812.patch, MAPREDUCE-3812.patch After a few performance improvements tracked at MAPREDUCE-3561, like MAPREDUCE-3511 and MAPREDUCE-3567, even a 100K maps job can also run within 1GB vmem. We earlier increased AM slot size from 1 slot to two slots to work around the issues with AM heap. Now that those are fixed, we should go back to 1GB. This is just a configuration change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3838) MapReduce job submission time has increased in 0.23 when compared to 0.20.206
[ https://issues.apache.org/jira/browse/MAPREDUCE-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3838: --- Fix Version/s: 0.23.2 MapReduce job submission time has increased in 0.23 when compared to 0.20.206 - Key: MAPREDUCE-3838 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3838 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: client Affects Versions: 0.23.0 Reporter: Amar Kamat Labels: gridmix, job-submit-time, yarn Fix For: 0.23.2 While running Gridmix on 0.23, we found that the job submission time has increased when compared to 0.20.206. Here are some stats: ||Submit-Time||Total number of jobs in YARN|| Total number of jobs in FRED|| | 25secs|3 |1 | | 20secs| 6 | 2 | | 15secs| 14 | 4 | | 10secs| 24 | 4 | | 5secs | 67 | 28| Note that Gridmix was run using the same trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208031#comment-13208031 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-3854: Looks good. +1. Pushing this in. Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3849) Change TokenCache's reading of the binary token file
[ https://issues.apache.org/jira/browse/MAPREDUCE-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated MAPREDUCE-3849: --- Attachment: MAPREDUCE-3849-2.patch Add more tests. Change TokenCache's reading of the binary token file Key: MAPREDUCE-3849 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3849 Project: Hadoop Map/Reduce Issue Type: Improvement Components: security Affects Versions: 0.23.1, 0.24.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Attachments: MAPREDUCE-3849-2.patch, MAPREDUCE-3849.patch When obtaining the tokens for a {{FileSystem}}, the {{TokenCache}} will read the binary token file if a token is not already in the {{Credentials}}. However, it will overwrite any existing tokens in the {{Credentials}} with the contents of the binary token file if a single token is missing. This may cause new tokens to be replaced with invalid/cancelled tokens from the binary file. The new tokens will not be canceled, and thus leak in the namenode until they expire. The binary tokens should be merged with, but not replace, existing tokens in the {{Credentials}}. The code that reads the binary token file is prefaced with: {code} //TODO: Need to come up with a better place to put //this block of code to do with reading the file {code} Also, the loading of the binary token file is the only reason that the {{TokenCache}} has to use {{getCanonicalService}}. If this linkage can be broken, then the 1-to-1 filesystem to token service coupling may be removed. And use of {{getCanonicalService}} can be removed in a subsequent jira. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3854: --- Resolution: Fixed Fix Version/s: 0.23.1 Release Note: Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I just committed this to trunk and branch-0.23. Thanks Tom! Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208049#comment-13208049 ] Alejandro Abdelnur commented on MAPREDUCE-3736: --- +1. I'll be committing this later today. Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208050#comment-13208050 ] Thomas Graves commented on MAPREDUCE-3824: -- After some debugging, it appears that the size isn't being calculated properly (set to 0) if the user specifies a directory to go into the distributed cache. It only appears to happen if its a private cached directory. I'm working on a patch. Allen, can you confirm that your users were specifying a directory to be cached and not a file? Distributed caches are not removed properly --- Key: MAPREDUCE-3824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824 Project: Hadoop Map/Reduce Issue Type: Bug Components: distributed-cache Affects Versions: 1.0.0 Reporter: Allen Wittenauer Priority: Critical Attachments: MAPREDUCE-3824-branch-1.0.txt Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAPREDUCE-3824) Distributed caches are not removed properly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned MAPREDUCE-3824: Assignee: Thomas Graves Distributed caches are not removed properly --- Key: MAPREDUCE-3824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824 Project: Hadoop Map/Reduce Issue Type: Bug Components: distributed-cache Affects Versions: 1.0.0 Reporter: Allen Wittenauer Assignee: Thomas Graves Priority: Critical Attachments: MAPREDUCE-3824-branch-1.0.txt Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208081#comment-13208081 ] Hudson commented on MAPREDUCE-3854: --- Integrated in Hadoop-Mapreduce-trunk-Commit #1739 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1739/]) MAPREDUCE-3854. Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. (Tom White via vinodkv) (Revision 1244223) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244223 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMiniMRChildTask.java Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208082#comment-13208082 ] Hudson commented on MAPREDUCE-3854: --- Integrated in Hadoop-Mapreduce-0.23-Commit #555 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/555/]) MAPREDUCE-3854. Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. (Tom White via vinodkv) svn merge --ignore-ancestry -c 1244223 ../../trunk/ (Revision 1244224) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244224 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMiniMRChildTask.java Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3583) ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException
[ https://issues.apache.org/jira/browse/MAPREDUCE-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208083#comment-13208083 ] Zhihong Yu commented on MAPREDUCE-3583: --- For TRUNK, should both of the following be included in patch ? {code} hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcfsBasedProcessTree.java hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java {code} ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException - Key: MAPREDUCE-3583 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3583 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.205.0 Environment: 64-bit Linux: asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: mapreduce-3583-v2.txt, mapreduce-3583-v3.txt, mapreduce-3583-v4.txt, mapreduce-3583-v5.txt, mapreduce-3583.txt HBase PreCommit builds frequently gave us NumberFormatException. From https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/org.apache.hadoop.hbase.mapreduce/TestHFileOutputFormat/testMRIncrementalLoad/: {code} 2011-12-20 01:44:01,180 WARN [main] mapred.JobClient(784): No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). java.lang.NumberFormatException: For input string: 18446743988060683582 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:422) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413) at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148) at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401) at org.apache.hadoop.mapred.Task.initialize(Task.java:536) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} From hadoop 0.20.205 source code, looks like ppid was 18446743988060683582, causing NFE: {code} // Set (name) (ppid) (pgrpId) (session) (utime) (stime) (vsize) (rss) pinfo.updateProcessInfo(m.group(2), Integer.parseInt(m.group(3)), {code} You can find information on the OS at the beginning of https://builds.apache.org/job/PreCommit-HBASE-Build/553/console: {code} asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 6 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2048 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 6 Running in Jenkins mode {code} From Nicolas Sze: {noformat} It looks like that the ppid is a 64-bit positive integer but Java long is signed and so only works with 63-bit positive integers. In your case, 2^64 18446743988060683582 2^63. Therefore, there is a NFE. {noformat} I propose changing allProcessInfo to MapString, ProcessInfo so that we don't encounter this problem by avoiding parsing large integer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Radwan updated MAPREDUCE-3736: Attachment: MAPREDUCE-3736_rev3.patch Updated patch, since location of yarn-default.xml was changed. Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch, MAPREDUCE-3736_rev3.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208090#comment-13208090 ] Allen Wittenauer commented on MAPREDUCE-3824: - Yes, that corresponds to what I was seeing as well. Sorry, forgot to mention the directory thing. I've been running the patch for so long and so successfully I forgot about that detail. Distributed caches are not removed properly --- Key: MAPREDUCE-3824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824 Project: Hadoop Map/Reduce Issue Type: Bug Components: distributed-cache Affects Versions: 1.0.0 Reporter: Allen Wittenauer Assignee: Thomas Graves Priority: Critical Attachments: MAPREDUCE-3824-branch-1.0.txt Distributed caches are not being properly removed by the TaskTracker when they are expected to be expired. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated MAPREDUCE-3858: - Resolution: Fixed Fix Version/s: 0.23.1 Target Version/s: 0.23.1 (was: 0.23.2) Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I just committed to all 3 branches 0.23.1, 0.23 and trunk. Thanks Tom! Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208100#comment-13208100 ] Alejandro Abdelnur commented on MAPREDUCE-3736: --- +1 Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch, MAPREDUCE-3736_rev3.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated MAPREDUCE-3736: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks Ahmed. Committed to trunk and branch-0.23 Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch, MAPREDUCE-3736_rev3.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208118#comment-13208118 ] Hudson commented on MAPREDUCE-3858: --- Integrated in Hadoop-Common-trunk-Commit #1727 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1727/]) MAPREDUCE-3858. Task attempt failure during commit results in task never completing. (Tom White via mahadev) (Revision 1244254) Result = SUCCESS mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244254 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208116#comment-13208116 ] Hudson commented on MAPREDUCE-3854: --- Integrated in Hadoop-Common-trunk-Commit #1727 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1727/]) MAPREDUCE-3854. Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. (Tom White via vinodkv) (Revision 1244223) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244223 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMiniMRChildTask.java Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3583) ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException
[ https://issues.apache.org/jira/browse/MAPREDUCE-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated MAPREDUCE-3583: -- Attachment: mapreduce-3583-trunk.txt Patch for TRUNK. All tests under hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core passed. TestProcfsBasedProcessTree passed as well. ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException - Key: MAPREDUCE-3583 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3583 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.205.0 Environment: 64-bit Linux: asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: mapreduce-3583-trunk.txt, mapreduce-3583-v2.txt, mapreduce-3583-v3.txt, mapreduce-3583-v4.txt, mapreduce-3583-v5.txt, mapreduce-3583.txt HBase PreCommit builds frequently gave us NumberFormatException. From https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/org.apache.hadoop.hbase.mapreduce/TestHFileOutputFormat/testMRIncrementalLoad/: {code} 2011-12-20 01:44:01,180 WARN [main] mapred.JobClient(784): No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). java.lang.NumberFormatException: For input string: 18446743988060683582 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:422) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413) at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148) at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401) at org.apache.hadoop.mapred.Task.initialize(Task.java:536) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} From hadoop 0.20.205 source code, looks like ppid was 18446743988060683582, causing NFE: {code} // Set (name) (ppid) (pgrpId) (session) (utime) (stime) (vsize) (rss) pinfo.updateProcessInfo(m.group(2), Integer.parseInt(m.group(3)), {code} You can find information on the OS at the beginning of https://builds.apache.org/job/PreCommit-HBASE-Build/553/console: {code} asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 6 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2048 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 6 Running in Jenkins mode {code} From Nicolas Sze: {noformat} It looks like that the ppid is a 64-bit positive integer but Java long is signed and so only works with 63-bit positive integers. In your case, 2^64 18446743988060683582 2^63. Therefore, there is a NFE. {noformat} I propose changing allProcessInfo to MapString, ProcessInfo so that we don't encounter this problem by avoiding parsing large integer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208136#comment-13208136 ] Hudson commented on MAPREDUCE-3858: --- Integrated in Hadoop-Hdfs-trunk-Commit #1802 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1802/]) MAPREDUCE-3858. Task attempt failure during commit results in task never completing. (Tom White via mahadev) (Revision 1244254) Result = SUCCESS mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244254 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208135#comment-13208135 ] Hudson commented on MAPREDUCE-3854: --- Integrated in Hadoop-Hdfs-trunk-Commit #1802 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1802/]) MAPREDUCE-3854. Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. (Tom White via vinodkv) (Revision 1244223) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244223 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMiniMRChildTask.java Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208134#comment-13208134 ] Hudson commented on MAPREDUCE-3736: --- Integrated in Hadoop-Hdfs-trunk-Commit #1802 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1802/]) MAPREDUCE-3736. Variable substitution depth too large for fs.default.name causes jobs to fail (ahmed via tucu) (Revision 1244264) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244264 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestMRWithDistributedCache.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/conf/TestNoDefaultsJobConf.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/fs/JHLogAnalyzer.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/io/FileBench.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestCombineFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestConcatenatedCompressedInput.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestTextInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/TestMapCollection.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestFileInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestMRKeyValueTextInputFormat.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch, MAPREDUCE-3736_rev3.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3736) Variable substitution depth too large for fs.default.name causes jobs to fail
[ https://issues.apache.org/jira/browse/MAPREDUCE-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208147#comment-13208147 ] Hudson commented on MAPREDUCE-3736: --- Integrated in Hadoop-Hdfs-0.23-Commit #539 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/539/]) Merge -r 1244263:1244264 from trunk to branch. FIXES: MAPREDUCE-3736 (Revision 1244265) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244265 Files : * /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestMRWithDistributedCache.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/conf/TestNoDefaultsJobConf.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/fs/JHLogAnalyzer.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/io/FileBench.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestCombineFileInputFormat.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestConcatenatedCompressedInput.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestTextInputFormat.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/TestMapCollection.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestFileInputFormat.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestMRKeyValueTextInputFormat.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Variable substitution depth too large for fs.default.name causes jobs to fail - Key: MAPREDUCE-3736 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3736 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: Eli Collins Assignee: Ahmed Radwan Priority: Blocker Attachments: MAPREDUCE-3736.patch, MAPREDUCE-3736_rev2.patch, MAPREDUCE-3736_rev3.patch I'm seeing the same failure as MAPREDUCE-3462 in downstream projects running against a recent build of branch-23. MR-3462 modified the tests rather than fixing the framework. In that jira Ravi mentioned I'm still ignorant of the change which made the tests start to fail. I should probably understand better the reasons for that change before proposing a more generalized fix. Let's figure out the general fix (rather than require all projects to set mapreduce.job.hdfs-servers in their conf we should fix this in the framework). Perhaps we should not default this config to $fs.default.name? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3854) Reinstate environment variable tests in TestMiniMRChildTask
[ https://issues.apache.org/jira/browse/MAPREDUCE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208148#comment-13208148 ] Hudson commented on MAPREDUCE-3854: --- Integrated in Hadoop-Hdfs-0.23-Commit #539 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/539/]) MAPREDUCE-3854. Fixed and reenabled tests related to MR child JVM's environmental variables in TestMiniMRChildTask. (Tom White via vinodkv) svn merge --ignore-ancestry -c 1244223 ../../trunk/ (Revision 1244224) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244224 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMiniMRChildTask.java Reinstate environment variable tests in TestMiniMRChildTask --- Key: MAPREDUCE-3854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3854 Project: Hadoop Map/Reduce Issue Type: Test Components: mrv2 Reporter: Tom White Assignee: Tom White Fix For: 0.23.1 Attachments: MAPREDUCE-3854.patch, MAPREDUCE-3854.patch MAPREDUCE-3716 reinstated one of the tests in TestMiniMRChildTask, but there are two more which should be run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208149#comment-13208149 ] Hudson commented on MAPREDUCE-3858: --- Integrated in Hadoop-Hdfs-0.23-Commit #539 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/539/]) MAPREDUCE-3858. Task attempt failure during commit results in task never completing. (Tom White via mahadev) - Merging r1244254 from trunk. (Revision 1244255) Result = SUCCESS mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1244255 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Fix For: 0.23.1 Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2793) [MR-279] Maintain consistency in naming appIDs, jobIDs and attemptIDs
[ https://issues.apache.org/jira/browse/MAPREDUCE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated MAPREDUCE-2793: -- Attachment: MAPREDUCE-2793-branch-0.23.patch Adding patch with code and test fixes. [MR-279] Maintain consistency in naming appIDs, jobIDs and attemptIDs -- Key: MAPREDUCE-2793 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2793 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Ramya Sunil Assignee: Bikas Saha Priority: Critical Fix For: 0.23.2 Attachments: MAPREDUCE-2793-branch-0.23.patch, MAPREDUCE-2793.patch appIDs, jobIDs and attempt/container ids are not consistently named in the logs, console and UI. For consistency purpose, they all have to follow a common naming convention. Currently, For appID = On the RM UI: app_1308259676864_5 On the JHS UI: No appID Console/logs: No appID mapred-local dirs are named as: application_1308259676864_0005 For jobID = On the RM UI: job_1308259676864_5_5 JHS UI: job_1308259676864_5_5 Console/logs: job_1308259676864_0005 mapred-local dirs are named as: No jobID For attemptID On the RM UI: attempt_1308259676864_5_5_m_24_0 JHS attempt_1308259676864_5_5_m_24_0 Console/logs: attempt_1308259676864_0005_m_24_0 mapred-local dirs are named as: container_1308259676864_0005_24 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3859) CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs
[ https://issues.apache.org/jira/browse/MAPREDUCE-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208204#comment-13208204 ] Arun C Murthy commented on MAPREDUCE-3859: -- Sergey, I'm pretty sure the reason you are hitting this is that you have a single user in your queue. By default, a single user can't exceed the queue's capacity (10 in this case). You can use 'user-limit-factor' to bump that up: http://hadoop.apache.org/common/docs/r1.0.0/capacity_scheduler.html#Configuration CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs Key: MAPREDUCE-3859 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3859 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/capacity-sched Affects Versions: 1.0.0 Environment: CDH3u1 Reporter: Sergey Tryuber Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3634) All daemons should crash instead of hanging around when their EventHandlers get exceptions
[ https://issues.apache.org/jira/browse/MAPREDUCE-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3634: --- Status: Patch Available (was: Open) All daemons should crash instead of hanging around when their EventHandlers get exceptions -- Key: MAPREDUCE-3634 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3634 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.2 Attachments: MAPREDUCE-3634-20120118.1.txt, MAPREDUCE-3634-20120119.txt, MAPREDUCE-3634-20120214.txt We should make sure that the daemons crash in case the dispatchers get exceptions and stop processing. That way we will be debugging RM/NM/AM crashes instead of hard-to-track hanging jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3634) All daemons should crash instead of hanging around when their EventHandlers get exceptions
[ https://issues.apache.org/jira/browse/MAPREDUCE-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3634: --- Fix Version/s: (was: 0.23.1) 0.23.2 All daemons should crash instead of hanging around when their EventHandlers get exceptions -- Key: MAPREDUCE-3634 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3634 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.2 Attachments: MAPREDUCE-3634-20120118.1.txt, MAPREDUCE-3634-20120119.txt, MAPREDUCE-3634-20120214.txt We should make sure that the daemons crash in case the dispatchers get exceptions and stop processing. That way we will be debugging RM/NM/AM crashes instead of hard-to-track hanging jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3634) All daemons should crash instead of hanging around when their EventHandlers get exceptions
[ https://issues.apache.org/jira/browse/MAPREDUCE-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3634: --- Attachment: MAPREDUCE-3634-20120214.txt Addressing Sharad's and Sid's comments + updating to the latest code. All daemons should crash instead of hanging around when their EventHandlers get exceptions -- Key: MAPREDUCE-3634 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3634 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.2 Attachments: MAPREDUCE-3634-20120118.1.txt, MAPREDUCE-3634-20120119.txt, MAPREDUCE-3634-20120214.txt We should make sure that the daemons crash in case the dispatchers get exceptions and stop processing. That way we will be debugging RM/NM/AM crashes instead of hard-to-track hanging jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3583) ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException
[ https://issues.apache.org/jira/browse/MAPREDUCE-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208261#comment-13208261 ] Mahadev konar commented on MAPREDUCE-3583: -- Looks like jenkins is down. Will run the trunk patch through hudson as soon as the build machines are up! ProcfsBasedProcessTree#constructProcessInfo() may throw NumberFormatException - Key: MAPREDUCE-3583 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3583 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.205.0 Environment: 64-bit Linux: asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: mapreduce-3583-trunk.txt, mapreduce-3583-v2.txt, mapreduce-3583-v3.txt, mapreduce-3583-v4.txt, mapreduce-3583-v5.txt, mapreduce-3583.txt HBase PreCommit builds frequently gave us NumberFormatException. From https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/org.apache.hadoop.hbase.mapreduce/TestHFileOutputFormat/testMRIncrementalLoad/: {code} 2011-12-20 01:44:01,180 WARN [main] mapred.JobClient(784): No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). java.lang.NumberFormatException: For input string: 18446743988060683582 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:422) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:413) at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:148) at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401) at org.apache.hadoop.mapred.Task.initialize(Task.java:536) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} From hadoop 0.20.205 source code, looks like ppid was 18446743988060683582, causing NFE: {code} // Set (name) (ppid) (pgrpId) (session) (utime) (stime) (vsize) (rss) pinfo.updateProcessInfo(m.group(2), Integer.parseInt(m.group(3)), {code} You can find information on the OS at the beginning of https://builds.apache.org/job/PreCommit-HBASE-Build/553/console: {code} asf011.sp2.ygridcore.net Linux asf011.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 6 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2048 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 6 Running in Jenkins mode {code} From Nicolas Sze: {noformat} It looks like that the ppid is a 64-bit positive integer but Java long is signed and so only works with 63-bit positive integers. In your case, 2^64 18446743988060683582 2^63. Therefore, there is a NFE. {noformat} I propose changing allProcessInfo to MapString, ProcessInfo so that we don't encounter this problem by avoiding parsing large integer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira