[jira] [Updated] (MAPREDUCE-6062) Use TestDFSIO test random read : job failed
[ https://issues.apache.org/jira/browse/MAPREDUCE-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-6062: Assignee: (was: Koji Noguchi) > Use TestDFSIO test random read : job failed > --- > > Key: MAPREDUCE-6062 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6062 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: benchmarks >Affects Versions: 2.2.0 > Environment: command : hadoop jar $JAR_PATH TestDFSIO-read -random > -nrFiles 12 -size 8000 >Reporter: chongyuanhuang > > This is log: > 2014-09-01 13:57:29,876 WARN [main] org.apache.hadoop.mapred.YarnChild: > Exception running child : java.lang.IllegalArgumentException: n must be > positive > at java.util.Random.nextInt(Random.java:300) > at > org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.nextOffset(TestDFSIO.java:601) > at > org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.doIO(TestDFSIO.java:580) > at > org.apache.hadoop.fs.TestDFSIO$RandomReadMapper.doIO(TestDFSIO.java:546) > at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:134) > at org.apache.hadoop.fs.IOMapperBase.map(IOMapperBase.java:37) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) > 2014-09-01 13:57:29,886 INFO [main] org.apache.hadoop.mapred.Task: Runnning > cleanup for the task > 2014-09-01 13:57:29,894 WARN [main] > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete > hdfs://m101:8020/benchmarks/TestDFSIO/io_random_read/_temporary/1/_temporary/attempt_1409538816633_0005_m_01_0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5439) mpared-default.xml has missing properties
[ https://issues.apache.org/jira/browse/MAPREDUCE-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725651#comment-13725651 ] Koji Noguchi commented on MAPREDUCE-5439: - I believe mapreduce.{map,reduce}.java.opts were intentionally left out from mapred-default.xml so that it won't overwrite user's mapred.child.java.opts setting. > mpared-default.xml has missing properties > - > > Key: MAPREDUCE-5439 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5439 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.1.0-beta >Reporter: Siddharth Wagle > Fix For: 2.1.0-beta > > > Properties that need to be added: > mapreduce.map.memory.mb > mapreduce.map.java.opts > mapreduce.reduce.memory.mb > mapreduce.reduce.java.opts > Properties that need to be fixed: > mapred.child.java.opts should not be in mapred-default. > yarn.app.mapreduce.am.command-opts description needs fixing -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAPREDUCE-114) All reducer tasks are finished, while some mapper tasks are still running
[ https://issues.apache.org/jira/browse/MAPREDUCE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi resolved MAPREDUCE-114. Resolution: Won't Fix Fixed in Yarn(MAPREDUCE-279). Not getting fixed in 0.20.*/1.*. > All reducer tasks are finished, while some mapper tasks are still running > - > > Key: MAPREDUCE-114 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-114 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Qi Liu > Attachments: hadoop-bug-overview.png, hadoop-bug-useless-task.png > > > In a high load environment (i.e. multiple jobs are queued up to be executed), > when all reducer tasks of a job are finished, some mapper tasks of the same > job may still running (possibly re-executed due to lost task tracker, etc). > This should not happen when a job has at least one reducer task. When all > reducer tasks are in SUCCEEDED state, the Hadoop JobTracker should kill all > running mapper tasks, since execution would be meaningless. The job should > also switch to SUCCEEDED state when all reducer tasks of that job succeeded > successfully. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3688) Need better Error message if AM is killed/throws exception
[ https://issues.apache.org/jira/browse/MAPREDUCE-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-3688: Attachment: mapreduce-3688-h0.23-v02.patch Another common error is ApplicationMaster going out of memory when number of tasks are large. Adding error message to stdout so that OOM would show. {quote} Diagnostics: Application application_1362579399138_0003 failed 1 times due to AM Container for appattempt_1362579399138_0003_01 exited with exitCode: 255 due to: Error starting MRAppMaster: java.lang.OutOfMemoryError: Java heap space at {quote} Forgot to mention but having these messages to UI also means it would show up on jobclient(console) side as well. > Need better Error message if AM is killed/throws exception > -- > > Key: MAPREDUCE-3688 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3688 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am, mrv2 >Affects Versions: 0.23.1 >Reporter: David Capwell >Assignee: Sandy Ryza > Fix For: 0.23.2 > > Attachments: mapreduce-3688-h0.23-v01.patch, > mapreduce-3688-h0.23-v02.patch > > > We need better error messages in the UI if the AM gets killed or throws an > Exception. > If the following error gets thrown: > java.lang.NumberFormatException: For input string: "9223372036854775807l" // > last char is an L > then the UI should say this exception. Instead I get the following: > Application application_1326504761991_0018 failed 1 times due to AM Container > for appattempt_1326504761991_0018_01 > exited with exitCode: 1 due to: Exception from container-launch: > org.apache.hadoop.util.Shell$ExitCodeException -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3688) Need better Error message if AM is killed/throws exception
[ https://issues.apache.org/jira/browse/MAPREDUCE-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-3688: Attachment: mapreduce-3688-h0.23-v01.patch This has been a pain for our users as well. I don't think this patch will fly well with the reviewers, but maybe it'll help move the discussion forward. I didn't see a good way of communicating the error message to the caller so decided to sacrifice the stdout that current MRAppMaster does not use. After the patch, webUI would show {quote} Diagnostics: Application application_1362527487477_0005 failed 1 times due to AM Container for appattempt_1362527487477_0005_01 exited with exitCode: 1 due to: Error starting MRAppMaster: org.apache.hadoop.yarn.YarnException: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1290) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1146) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1118) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:382) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:823) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:121) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1094) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:998) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1273) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1269) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1226) Caused by: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:53) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1285) ... 16 more .Failing this attempt.. Failing the application. {quote} (This patch is based on 0.23) > Need better Error message if AM is killed/throws exception > -- > > Key: MAPREDUCE-3688 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3688 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am, mrv2 >Affects Versions: 0.23.1 >Reporter: David Capwell >Assignee: Sandy Ryza > Fix For: 0.23.2 > > Attachments: mapreduce-3688-h0.23-v01.patch > > > We need better error messages in the UI if the AM gets killed or throws an > Exception. > If the following error gets thrown: > java.lang.NumberFormatException: For input string: "9223372036854775807l" // > last char is an L > then the UI should say this exception. Instead I get the following: > Application application_1326504761991_0018 failed 1 times due to AM Container > for appattempt_1326504761991_0018_01 > exited with exitCode: 1 due to: Exception from container-launch: > org.apache.hadoop.util.Shell$ExitCodeException -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x
[ https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-4499: Description: When there are lots of jobs and tasks active in a cluster, the process of figuring out whether or not to launch a speculative task becomes very expensive. I could be missing something but it certainly looks like on every heartbeat we could be scanning 10's of thousands of tasks looking for something which might need to be speculatively executed. In most cases, nothing gets chosen so we completely trashed our data cache and didn't even find a task to schedule, just to do it all over again on the next heartbeat. On busy jobtrackers, the following backtrace is very common: "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 nid=0xb50 runnable [0x45adb000] java.lang.Thread.State: RUNNABLE at java.util.TreeMap.valEquals(TreeMap.java:1182) at java.util.TreeMap.containsValue(TreeMap.java:227) at java.util.TreeMap$Values.contains(TreeMap.java:940) at org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) at org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) at org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075) at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044) - locked <0x2aab6e27a4c8> (a org.apache.hadoop.mapred.CapacityTaskScheduler) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398) - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker) ...) was: > Looking for speculative tasks is very expensive in 1.x > -- > > Key: MAPREDUCE-4499 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, performance >Affects Versions: 1.0.3 >Reporter: Nathan Roberts >Assignee: Koji Noguchi > Fix For: 1.2.0 > > Attachments: mapreduce-4499-v1.0.2-1.patch > > > When there are lots of jobs and tasks active in a cluster, the process of > figuring out whether or not to launch a speculative task becomes very > expensive. > I could be missing something but it certainly looks like on every heartbeat > we could be scanning 10's of thousands of tasks looking for something which > might need to be speculatively executed. In most cases, nothing gets chosen > so we completely trashed our data cache and didn't even find a task to > schedule, just to do it all over again on the next heartbeat. > On busy jobtrackers, the following backtrace is very common: > "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 > nid=0xb50 runnable [0x45adb000] > java.lang.Thread.State: RUNNABLE >at java.util.TreeMap.valEquals(TreeMap.java:1182) >at java.util.TreeMap.containsValue(TreeMap.java:227) >at java.util.TreeMap$Values.contains(TreeMap.java:940) >at > org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) >at > org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) >- locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) >at > org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) >- locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) >at > org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) >- locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) >at > org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) >at > org.apache.hadoop.mapred.CapacityTaskSched
[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x
[ https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-4499: Description: (was: When there are lots of jobs and tasks active in a cluster, the process of figuring out whether or not to launch a speculative task becomes very expensive. I could be missing something but it certainly looks like on every heartbeat we could be scanning 10's of thousands of tasks looking for something which might need to be speculatively executed. In most cases, nothing gets chosen so we completely trashed our data cache and didn't even find a task to schedule, just to do it all over again on the next heartbeat. On busy jobtrackers, the following backtrace is very common: "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 nid=0xb50 runnable [0x45adb000] java.lang.Thread.State: RUNNABLE at java.util.TreeMap.valEquals(TreeMap.java:1182) at java.util.TreeMap.containsValue(TreeMap.java:227) at java.util.TreeMap$Values.contains(TreeMap.java:940) at org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) at org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) - locked <0x2aaefde82338> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) at org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075) at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044) - locked <0x2aab6e27a4c8> (a org.apache.hadoop.mapred.CapacityTaskScheduler) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398) - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker) ...) > Looking for speculative tasks is very expensive in 1.x > -- > > Key: MAPREDUCE-4499 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, performance >Affects Versions: 1.0.3 >Reporter: Nathan Roberts >Assignee: Koji Noguchi > Fix For: 1.2.0 > > Attachments: mapreduce-4499-v1.0.2-1.patch > > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504767#comment-13504767 ] Koji Noguchi commented on MAPREDUCE-4819: - bq. I don't want the correctness of the job to depend on the marker on hdfs. I meant, hdfs on user space like outputpath. If this is stored elsewhere where user cannot access, I have no problem. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504758#comment-13504758 ] Koji Noguchi commented on MAPREDUCE-4819: - bq. like the client never being notified at all because the AM crashes after unregistering with the RM but before it notifies the client. As long as client eventually fail, that's not a problem. Critical problem we have here is false-positive from the client's perspective. Client is getting 'success' but output is incomplete or corrupt(due to retried application/job (over)writing to the same target path.) If we can have AM and RM to agree on the job status before telling the client, I think that would work. There could be a corner case when AM and RM say the job was successful but client thinks it failed. This false-negative is much better than false-positive issue we have now. Even in 0.20, we had cases when JobTracker reports job was successful but client thinks it failed due to communication failure to the JobTracker. This is fine to happen and we should let the client handle the recovery-or-retry. bq. In general it seems like we need to come up with a set of markers that previous AM's leave behind I don't want the correctness of the job to depend on the marker on hdfs. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x
[ https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-4499: Attachment: mapreduce-4499-v1.0.2-1.patch Attaching a patch with if&else rewrite. Trying to change the order of boolean condition but not changing the logic. > Looking for speculative tasks is very expensive in 1.x > -- > > Key: MAPREDUCE-4499 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, performance >Affects Versions: 1.0.3 >Reporter: Nathan Roberts > Attachments: mapreduce-4499-v1.0.2-1.patch > > > When there are lots of jobs and tasks active in a cluster, the process of > figuring out whether or not to launch a speculative task becomes very > expensive. > I could be missing something but it certainly looks like on every heartbeat > we could be scanning 10's of thousands of tasks looking for something which > might need to be speculatively executed. In most cases, nothing gets chosen > so we completely trashed our data cache and didn't even find a task to > schedule, just to do it all over again on the next heartbeat. > On busy jobtrackers, the following backtrace is very common: > "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 > nid=0xb50 runnable [0x45adb000] >java.lang.Thread.State: RUNNABLE > at java.util.TreeMap.valEquals(TreeMap.java:1182) > at java.util.TreeMap.containsValue(TreeMap.java:227) > at java.util.TreeMap$Values.contains(TreeMap.java:940) > at > org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) > at > org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044) > - locked <0x2aab6e27a4c8> (a > org.apache.hadoop.mapred.CapacityTaskScheduler) > at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398) > - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker) > ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x
[ https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436225#comment-13436225 ] Koji Noguchi commented on MAPREDUCE-4499: - Looked at one of the busy JobTrackers. Attached btrace for couple of secs and counted the booleans. Out of 2093791 JobInProgress.findSpeculativeTask calls, 2437 of them had shouldRemove=true. Out of 2213670 TaskInProgress.hasSpeculativeTask calls, 137 of them were 'true'. Of course these numbers differ from cluster to cluster, but I believe it shows the possibility of some savings. > Looking for speculative tasks is very expensive in 1.x > -- > > Key: MAPREDUCE-4499 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, performance >Affects Versions: 1.0.3 >Reporter: Nathan Roberts > > When there are lots of jobs and tasks active in a cluster, the process of > figuring out whether or not to launch a speculative task becomes very > expensive. > I could be missing something but it certainly looks like on every heartbeat > we could be scanning 10's of thousands of tasks looking for something which > might need to be speculatively executed. In most cases, nothing gets chosen > so we completely trashed our data cache and didn't even find a task to > schedule, just to do it all over again on the next heartbeat. > On busy jobtrackers, the following backtrace is very common: > "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 > nid=0xb50 runnable [0x45adb000] >java.lang.Thread.State: RUNNABLE > at java.util.TreeMap.valEquals(TreeMap.java:1182) > at java.util.TreeMap.containsValue(TreeMap.java:227) > at java.util.TreeMap$Values.contains(TreeMap.java:940) > at > org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) > at > org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044) > - locked <0x2aab6e27a4c8> (a > org.apache.hadoop.mapred.CapacityTaskScheduler) > at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398) > - locked <0x2aab6e191278> (a org.apache.hadoop.mapred.JobTracker) > ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4499) Looking for speculative tasks is very expensive in 1.x
[ https://issues.apache.org/jira/browse/MAPREDUCE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435573#comment-13435573 ] Koji Noguchi commented on MAPREDUCE-4499: - Assuming the cost of tip.hasRunOnMachine is expensive, we can try reordering the if/else so that we call it less often. >From JobInProgress.java {noformat} 2196 if (!tip.hasRunOnMachine(ttStatus.getHost(), 2197ttStatus.getTrackerName())) { 2198 if (tip.hasSpeculativeTask(currentTime, avgProgress)) { 2199 // In case of shared list we don't remove it. Since the TIP failed 2200 // on this tracker can be scheduled on some other tracker. 2201 if (shouldRemove) { 2202 iter.remove(); //this tracker is never going to run it again 2203 } 2204 return tip; 2205 } 2206 } else { 2207 // Check if this tip can be removed from the list. 2208 // If the list is shared then we should not remove. 2209 if (shouldRemove) { 2210 // This tracker will never speculate this tip 2211 iter.remove(); 2212 } 2213 } 2214 } {noformat} Checking the action for each true&false. {noformat} tip.hasRuntip.hasSpeculative shouldRemove Action F F F - F F T - F T F return tip F T T iter.remove() & return tip; T F F - T F T iter.remove() T T F - T T T iter.remove() {noformat} Can we rewrite the above logic to {noformat} if (tip.hasSpeculative) { if(shouldRemove){ iter.remove(); } if(!tip.hasRun) { return tip; } } else { if (shouldRemove && tip.hasRun ){ iter.remove(); } } {noformat} >From the jstack we see, I can tell that shouldRemove is often 'false' in our >case. Depending on the value of tip.hasSpeculative, we may reduce the >tip.hasRun calls with this rewrite. (I don't know how often 'hasSpeculative' is true.) > Looking for speculative tasks is very expensive in 1.x > -- > > Key: MAPREDUCE-4499 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4499 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, performance >Affects Versions: 1.0.3 >Reporter: Nathan Roberts > > When there are lots of jobs and tasks active in a cluster, the process of > figuring out whether or not to launch a speculative task becomes very > expensive. > I could be missing something but it certainly looks like on every heartbeat > we could be scanning 10's of thousands of tasks looking for something which > might need to be speculatively executed. In most cases, nothing gets chosen > so we completely trashed our data cache and didn't even find a task to > schedule, just to do it all over again on the next heartbeat. > On busy jobtrackers, the following backtrace is very common: > "IPC Server handler 32 on 50300" daemon prio=10 tid=0x2ab36c74f800 > nid=0xb50 runnable [0x45adb000] >java.lang.Thread.State: RUNNABLE > at java.util.TreeMap.valEquals(TreeMap.java:1182) > at java.util.TreeMap.containsValue(TreeMap.java:227) > at java.util.TreeMap$Values.contains(TreeMap.java:940) > at > org.apache.hadoop.mapred.TaskInProgress.hasRunOnMachine(TaskInProgress.java:1072) > at > org.apache.hadoop.mapred.JobInProgress.findSpeculativeTask(JobInProgress.java:2193) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2417) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.JobInProgress.obtainNewNonLocalMapTask(JobInProgress.java:1432) > - locked <0x2aaefde82338> (a > org.apache.hadoop.mapred.JobInProgress) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:525) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) > at > org.apache.hadoop.mapred.CapacityTaskSchedul
[jira] [Updated] (MAPREDUCE-1684) ClusterStatus can be cached in CapacityTaskScheduler.assignTasks()
[ https://issues.apache.org/jira/browse/MAPREDUCE-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-1684: Attachment: mapreduce-1684-v1.0.2-1.patch bq. Currently, CapacityTaskScheduler.assignTasks() calls getClusterStatus() thrice I think it calls getClusterStatus calls #jobs times in the worst case. For each heartbeat from TaskTracker with some slots available, {noformat} heartbeat --> assignTasks --> addMap/ReduceTasks --> TaskSchedulingMgr.assignTasks --> For each queue : queuesForAssigningTasks) --> getTaskFromQueue(queue) --> For each j : queue.getRunningJobs() --> obtainNewTask --> **getClusterStatus** {noformat} bq. It can be cached in assignTasks() and re-used. Attaching a patch. Would this work? Motivation is, we see getClusterStatus way too often in our jstack holding the global lock. {noformat} "IPC Server handler 15 on 50300" daemon prio=10 tid=0x5fc5d800 nid=0x6828 runnable [0x44847000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:4065) - locked <0x2aab6e638bd8> (a org.apache.hadoop.mapred.JobTracker) at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:503) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:322) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:419) at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:150) at org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTasks(CapacityTaskScheduler.java:1075) at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1044) - locked <0x2aab6e7ffb10> (a org.apache.hadoop.mapred.CapacityTaskScheduler) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3398) - locked <0x2aab6e638bd8> (a org.apache.hadoop.mapred.JobTracker) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) {noformat} > ClusterStatus can be cached in CapacityTaskScheduler.assignTasks() > -- > > Key: MAPREDUCE-1684 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1684 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: capacity-sched >Reporter: Amareshwari Sriramadasu > Attachments: mapreduce-1684-v1.0.2-1.patch > > > Currently, CapacityTaskScheduler.assignTasks() calls getClusterStatus() > thrice: once in assignTasks(), once in MapTaskScheduler and once in > ReduceTaskScheduler. It can be cached in assignTasks() and re-used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4352) Jobs fail during resource localization when directories in file cache reaches to unix directory limit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396773#comment-13396773 ] Koji Noguchi commented on MAPREDUCE-4352: - Sounds similar to MAPREDUCE-1538 (pre-2.0). > Jobs fail during resource localization when directories in file cache reaches > to unix directory limit > - > > Key: MAPREDUCE-4352 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4352 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.0.1-alpha, 3.0.0 >Reporter: Devaraj K >Assignee: Devaraj K > > If we have multiple jobs which uses distributed cache with small size of > files, the directory limit reaches before reaching the cache size and fails > to create any directories in file cache. The jobs start failing with the > below exception. > {code:xml} > java.io.IOException: mkdir of > /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) > at > org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {code} > We should have a mechanism to clean the cache files if it crosses specified > number of directories like cache size. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2765) DistCp Rewrite
[ https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084215#comment-13084215 ] Koji Noguchi commented on MAPREDUCE-2765: - bq. Would the reviewers/watchers kindly comment on whether it's alright to deprecate the "-filelimit" and "-sizelimit" options, in DistCpV2? bq. +1. I think we (Yahoo) requested but ended up not using it at all. Just to be clear bq. The file-listing isn't filtered until the map-task runs bq. This used to be the case in old old distcp. We changed that when we added this -filelimit feature (that we never used). > DistCp Rewrite > -- > > Key: MAPREDUCE-2765 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: distcp >Affects Versions: 0.20.203.0 >Reporter: Mithun Radhakrishnan >Assignee: Mithun Radhakrishnan > Attachments: distcpv2.20.203.patch, distcpv2_trunk.patch > > > This is a slightly modified version of the DistCp rewrite that Yahoo uses in > production today. The rewrite was ground-up, with specific focus on: > 1. improved startup time (postponing as much work as possible to the MR job) > 2. support for multiple copy-strategies > 3. new features (e.g. -atomic, -async, -bandwidth.) > 4. improved programmatic use > Some effort has gone into refactoring what used to be achieved by a single > large (1.7 KLOC) source file, into a design that (hopefully) reads better too. > The proposed DistCpV2 preserves command-line-compatibility with the old > version, and should be a drop-in replacement. > New to v2: > 1. Copy-strategies and the DynamicInputFormat: > A copy-strategy determines the policy by which source-file-paths are > distributed between map-tasks. (These boil down to the choice of the > input-format.) > If no strategy is explicitly specified on the command-line, the policy > chosen is "uniform size", where v2 behaves identically to old-DistCp. (The > number of bytes transferred by each map-task is roughly equal, at a per-file > granularity.) > Alternatively, v2 ships with a "dynamic" copy-strategy (in the > DynamicInputFormat). This policy acknowledges that > (a) dividing files based only on file-size might not be an > even distribution (E.g. if some datanodes are slower than others, or if some > files are skipped.) > (b) a "static" association of a source-path to a map increases > the likelihood of long-tails during copy. > The "dynamic" strategy divides the list-of-source-paths into a number > (> nMaps) of smaller parts. When each map completes its current list of > paths, it picks up a new list to process, if available. So if a map-task is > stuck on a slow (and not necessarily large) file, other maps can pick up the > slack. The thinner the file-list is sliced, the greater the parallelism (and > the lower the chances of long-tails). Within reason, of course: the number of > these short-lived list-files is capped at an overridable maximum. > Internal benchmarks against source/target clusters with some slow(ish) > datanodes have indicated significant performance gains when using the > dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps. > Please note that the DynamicInputFormat might prove useful outside of > DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also > note that the copy-strategies have no bearing on the CopyMapper.map() > implementation. > > 2. Improved startup-time and programmatic use: > When the old-DistCp runs with -update, and creates the > list-of-source-paths, it attempts to filter out files that might be skipped > (by comparing file-sizes, checksums, etc.) This significantly increases the > startup time (or the time spent in serial processing till the MR job is > launched), blocking the calling-thread. This becomes pronounced as nFiles > increases. (Internal benchmarks have seen situations where more time is spent > setting up the job than on the actual transfer.) > DistCpV2 postpones as much work as possible to the MR job. The > file-listing isn't filtered until the map-task runs (at which time, identical > files are skipped). DistCpV2 can now be run "asynchronously". The program > quits at job-launch, logging the job-id for tracking. Programmatically, the > DistCp.execute() returns a Job instance for progress-tracking. > > 3. New features: > (a) -async: As described in #2. > (b) -atomic: Data is copied to a (user-specifiable) tmp-location, and > then moved atomically to destination. > (c) -bandwidth: Enforces a limit on the bandwidth consumed per map. > (d) -strategy: As above. > > A more comprehensive descri
[jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere
[ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073827#comment-13073827 ] Koji Noguchi commented on MAPREDUCE-2324: - bq. Should we just disable that check? +1 > Job should fail if a reduce task can't be scheduled anywhere > > > Key: MAPREDUCE-2324 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.20.2, 0.20.205.0 >Reporter: Todd Lipcon >Assignee: Robert Joseph Evans > Fix For: 0.20.205.0 > > Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, > MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch > > > If there's a reduce task that needs more disk space than is available on any > mapred.local.dir in the cluster, that task will stay pending forever. For > example, we produced this in a QA cluster by accidentally running terasort > with one reducer - since no mapred.local.dir had 1T free, the job remained in > pending state for several days. The reason for the "stuck" task wasn't clear > from a user perspective until we looked at the JT logs. > Probably better to just fail the job if a reduce task goes through all TTs > and finds that there isn't enough space. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2510) TaskTracker throw OutOfMemoryError after upgrade to jetty6
[ https://issues.apache.org/jira/browse/MAPREDUCE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039365#comment-13039365 ] Koji Noguchi commented on MAPREDUCE-2510: - > > Is there an issue about upgrading Jetty to 6.1.26? > None that I'm aware of. Upgrading from Jetty5 to Jetty6 was painful, but > upgrading within Jetty6 has been very easy. > After 6.1.26 upgrade, I think we started seeing various fetch failure issues that persists on TaskTracker/jetty delaying the jobs. (MAPREDUCE-2529, MAPREDUCE-2530, etc) So far we haven't found any fixes and instead working on a workaround. > TaskTracker throw OutOfMemoryError after upgrade to jetty6 > -- > > Key: MAPREDUCE-2510 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2510 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Liyin Liang > > Our product cluster's TaskTracker sometimes throw OutOfMemoryError after > upgrade to jetty6. The exception in TT's log is as follows: > 2011-05-17 19:16:40,756 ERROR org.mortbay.log: Error for /mapOutput > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedInputStream.(BufferedInputStream.java:178) > at > org.apache.hadoop.fs.BufferedFSInputStream.(BufferedFSInputStream.java:44) > at > org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359) > at > org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3040) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:324) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) > Exceptions in .out file: > java.lang.OutOfMemoryError: Java heap space > Exception in thread "process reaper" java.lang.OutOfMemoryError: Java heap > space > Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap > space > java.lang.OutOfMemoryError: Java heap space > java.lang.reflect.InvocationTargetException > Exception in thread "IPC Server handler 6 on 50050" at > sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.mortbay.log.Slf4jLog.warn(Slf4jLog.java:126) > at org.mortbay.log.Log.warn(Log.java:181) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:449) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:324) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) > at org.mortbay.jetty.
[jira] [Commented] (MAPREDUCE-2476) Set mapreduce scheduler to capacity scheduler for RPM/Debian packages by default
[ https://issues.apache.org/jira/browse/MAPREDUCE-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029625#comment-13029625 ] Koji Noguchi commented on MAPREDUCE-2476: - In addition to Todd's point, do many users really need the features from capacity scheduler (and/or fair scheduler) ? > Set mapreduce scheduler to capacity scheduler for RPM/Debian packages by > default > > > Key: MAPREDUCE-2476 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2476 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: build >Affects Versions: 0.20.203.1 > Environment: Redhat 5.5, Java 6 >Reporter: Eric Yang >Assignee: Eric Yang > > Hadoop RPM/Debian package is default to use the default scheduler. It would > be nice to setup the packages to use capacity scheduler instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (MAPREDUCE-2291) TaskTracker Decommission that waits for all map(intermediate) outputs to be pulled
TaskTracker Decommission that waits for all map(intermediate) outputs to be pulled --- Key: MAPREDUCE-2291 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2291 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobtracker Reporter: Koji Noguchi On our clusters, users were getting affected when ops were decommissioning a large number of TaskTracker nodes. Correct me if I'm wrong, but current decommission of TaskTrackers only waits for the running tasks to finish but not the jobs where map outputs are kept on that decommissioning tasktrackers. Any ways we can handle this better? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (MAPREDUCE-2075) Show why the job failed (e.g. Job ___ failed because task ____ failed 4 times)
[ https://issues.apache.org/jira/browse/MAPREDUCE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi resolved MAPREDUCE-2075. - Resolution: Duplicate Duplicate of MAPREDUCE-343. > Show why the job failed (e.g. Job ___ failed because task failed 4 > times) > --- > > Key: MAPREDUCE-2075 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2075 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Koji Noguchi >Priority: Minor > Fix For: 0.22.0 > > > When our users have questions about their jobs' failure, they tend to > copy&paste all the userlog exceptions they see on the webui/console. > However, most of them are not the one that caused the job to fail. When we > tell them 'This task failed 4 times", sometimes that's enough information for > them to solve the problem on their own. > It would be nice if jobclient or job status page shows the reason for the job > being flagged as fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-2076) Showing inputsplit filename/offset inside the webui or tasklog
Showing inputsplit filename/offset inside the webui or tasklog -- Key: MAPREDUCE-2076 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2076 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Koji Noguchi Priority: Minor Fix For: 0.22.0 For debugging purposes, it would be nice to have inputsplit's filename and offset for FileInputFormat and alike. (in addition to input split's node list that is already shown) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-2075) Show why the job failed (e.g. Job ___ failed because task ____ failed 4 times)
Show why the job failed (e.g. Job ___ failed because task failed 4 times) --- Key: MAPREDUCE-2075 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2075 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Koji Noguchi Priority: Minor Fix For: 0.22.0 When our users have questions about their jobs' failure, they tend to copy&paste all the userlog exceptions they see on the webui/console. However, most of them are not the one that caused the job to fail. When we tell them 'This task failed 4 times", sometimes that's enough information for them to solve the problem on their own. It would be nice if jobclient or job status page shows the reason for the job being flagged as fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-2074) Task should fail when symlink creation fail
Task should fail when symlink creation fail --- Key: MAPREDUCE-2074 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2074 Project: Hadoop Map/Reduce Issue Type: Bug Components: distributed-cache Affects Versions: 0.20.2 Reporter: Koji Noguchi Priority: Minor Fix For: 0.22.0 If I pass an invalid symlink as -Dmapred.cache.files=/user/knoguchi/onerecord.txt#abc/abc Task only reports a WARN and goes on. {noformat} 2010-09-16 21:38:49,782 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /0/tmp/mapred-local/taskTracker/knoguchi/distcache/-5031501808205559510_-128488332_1354038698/abc-nn1.def.com/user/knoguchi/onerecord.txt <- /0/tmp/mapred-local/taskTracker/knoguchi/jobcache/job_201008310107_15105/attempt_201008310107_15105_m_00_0/work/./abc/abc 2010-09-16 21:38:49,789 WARN org.apache.hadoop.mapred.TaskRunner: Failed to create symlink: /0/tmp/mapred-local/taskTracker/knoguchi/distcache/-5031501808205559510_-128488332_1354038698/abc-nn1.def.com/user/knoguchi/onerecord.txt <- /0/tmp/mapred-local/taskTracker/knoguchi/jobcache/job_201008310107_15105/attempt_201008310107_15105_m_00_0/work/./abc/abc {noformat} I believe we should fail the task at this point. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
[ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898990#action_12898990 ] Koji Noguchi commented on MAPREDUCE-1901: - {quote} The TaskTracker, on being requested to run a task requiring CAR resource md5_F checks whether md5_F is localized. * If md5_F is already localized - then nothing more needs to be done. the localized version is used by the Task * If md5_F is not localized - then its fetched from the CAR repository {quote} What are we gaining by using md5_F on the TaskTracker side? Can we use the existing 'cacheStatus.mtime == confFileStamp' check and change the order of the check so that no unnecessary getFileStatus call is made (MAPRED-2011)? Otherwise, this can only be used for dist files loaded by this framework and would require two separate logic on the TaskTracker side. > Jobs should not submit the same jar files over and over again > - > > Key: MAPREDUCE-1901 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Joydeep Sen Sarma > Attachments: 1901.PATCH > > > Currently each Hadoop job uploads the required resources > (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in > executing this job would then download these resources into local disk. > In an environment where most of the users are using a standard set of jars > and files (because they are using a framework like Hive/Pig) - the same jars > keep getting uploaded and downloaded repeatedly. The overhead of this > protocol (primarily in terms of end-user latency) is significant when: > - the jobs are small (and conversantly - large in number) > - Namenode is under load (meaning hdfs latencies are high and made worse, in > part, by this protocol) > Hadoop should provide a way for jobs in a cooperative environment to not > submit the same files over and again. Identifying and caching execution > resources by a content signature (md5/sha) would be a good alternative to > have available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
[ https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898981#action_12898981 ] Koji Noguchi commented on MAPREDUCE-2011: - MAPREDUCE-1901 has a detail proposal of how to handle distributed cache better for those loaded by jobclient (-libjars). As part of it, it mentions {quote} The TaskTracker, on being requested to run a task requiring CAR resource md5_F checks whether md5_F is localized. * If md5_F is already localized - then nothing more needs to be done. the localized version is used by the Task * If md5_F is not localized - then its fetched from the CAR repository {quote} This Jira is basically almost asking the same except for asking to use existing mtime instead of a new md5_F proposed. Just to reduce the mtime/getFileStatus calls, mtime check is enough and can keep the change small. > Reduce number of getFileStatus call made from every > task(TaskDistributedCache) setup > > > Key: MAPREDUCE-2011 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distributed-cache >Reporter: Koji Noguchi > > On our cluster, we had jobs with 20 dist cache and very short-lived tasks > resulting in 500 map tasks launched per second resulting in 10,000 > getFileStatus calls to the namenode. Namenode can handle this but asking to > see if we can reduce this somehow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
[ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898484#action_12898484 ] Koji Noguchi commented on MAPREDUCE-1901: - bq. For me, that's not of a worry. It may delay individual job submissions, but the overall load to the hdfs isn't much. bq. (at least compared to later phase of hundreds and thousands of tasktrackers looking up mtime of 'all those jars'.) Since my problem is just about lookup of mtime, created a new jira MAPREDUCE-2011. > Jobs should not submit the same jar files over and over again > - > > Key: MAPREDUCE-1901 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Joydeep Sen Sarma > Attachments: 1901.PATCH > > > Currently each Hadoop job uploads the required resources > (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in > executing this job would then download these resources into local disk. > In an environment where most of the users are using a standard set of jars > and files (because they are using a framework like Hive/Pig) - the same jars > keep getting uploaded and downloaded repeatedly. The overhead of this > protocol (primarily in terms of end-user latency) is significant when: > - the jobs are small (and conversantly - large in number) > - Namenode is under load (meaning hdfs latencies are high and made worse, in > part, by this protocol) > Hadoop should provide a way for jobs in a cooperative environment to not > submit the same files over and again. Identifying and caching execution > resources by a content signature (md5/sha) would be a good alternative to > have available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
[ https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898481#action_12898481 ] Koji Noguchi commented on MAPREDUCE-2011: - When a task is initialized, it calls getFileStatus for every distributed cache file/archive entries it has (_dfsFileStamp_) and compare it with task's timestamp specified in the config (_confFileStamp_). This makes sure that tasks fail *at start up* if distributed cache files were changed after the job was submitted and before the task started. (It still doesn't guarantee that job would fail reliably since all the tasks could have been started before the modification.) Now asking if we can change this logic to, If exact localized cache exists ('lcacheStatus.mtime == confFileStamp ') on the TaskTracker, use that and do not call getFileStatus(_dfsFileStamp_). With this, no getFileStatus calls are made if TaskTracker already has the localized cache with the same timestamp. This should reduce the amount of getFileStatus calls significantly when people submit jobs using the same distributed cache files. This still makes sure that all the tasks use the same dist cache files specified at the job startup. (corectness) But with this change, tasks that would have failed at start-up due to (_dfsFileStamp_ != _confFileStamp_) can now succeed. > Reduce number of getFileStatus call made from every > task(TaskDistributedCache) setup > > > Key: MAPREDUCE-2011 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distributed-cache >Reporter: Koji Noguchi > > On our cluster, we had jobs with 20 dist cache and very short-lived tasks > resulting in 500 map tasks launched per second resulting in 10,000 > getFileStatus calls to the namenode. Namenode can handle this but asking to > see if we can reduce this somehow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup Key: MAPREDUCE-2011 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distributed-cache Reporter: Koji Noguchi On our cluster, we had jobs with 20 dist cache and very short-lived tasks resulting in 500 map tasks launched per second resulting in 10,000 getFileStatus calls to the namenode. Namenode can handle this but asking to see if we can reduce this somehow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
[ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898300#action_12898300 ] Koji Noguchi commented on MAPREDUCE-1901: - bq. u would get a trace of all those jars being uploaded to hdfs. it's ridiculous. For me, that's not of a worry. It may delay individual job submissions, but the overall load to the hdfs isn't much. (at least compared to later phase of hundreds and thousands of tasktrackers looking up mtime of 'all those jars'.) > Jobs should not submit the same jar files over and over again > - > > Key: MAPREDUCE-1901 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Joydeep Sen Sarma > Attachments: 1901.PATCH > > > Currently each Hadoop job uploads the required resources > (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in > executing this job would then download these resources into local disk. > In an environment where most of the users are using a standard set of jars > and files (because they are using a framework like Hive/Pig) - the same jars > keep getting uploaded and downloaded repeatedly. The overhead of this > protocol (primarily in terms of end-user latency) is significant when: > - the jobs are small (and conversantly - large in number) > - Namenode is under load (meaning hdfs latencies are high and made worse, in > part, by this protocol) > Hadoop should provide a way for jobs in a cooperative environment to not > submit the same files over and again. Identifying and caching execution > resources by a content signature (md5/sha) would be a good alternative to > have available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
[ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897833#action_12897833 ] Koji Noguchi commented on MAPREDUCE-1901: - bq. we have started testing this patch internally and this would become production in a couple of weeks. Joydeep, is this being tested on your production? How does the load look like? I don't know the details, but I like the "part of the goal here is to not have to look up mtimes again and again. " part. We certainly have applications with many small tasks having multiple libjar/distributed-caches resulting with too many getfileinfo calls to the namenode. > Jobs should not submit the same jar files over and over again > - > > Key: MAPREDUCE-1901 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Joydeep Sen Sarma > Attachments: 1901.PATCH > > > Currently each Hadoop job uploads the required resources > (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in > executing this job would then download these resources into local disk. > In an environment where most of the users are using a standard set of jars > and files (because they are using a framework like Hive/Pig) - the same jars > keep getting uploaded and downloaded repeatedly. The overhead of this > protocol (primarily in terms of end-user latency) is significant when: > - the jobs are small (and conversantly - large in number) > - Namenode is under load (meaning hdfs latencies are high and made worse, in > part, by this protocol) > Hadoop should provide a way for jobs in a cooperative environment to not > submit the same files over and again. Identifying and caching execution > resources by a content signature (md5/sha) would be a good alternative to > have available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1839) HadoopArchives should provide a way to configure replication
[ https://issues.apache.org/jira/browse/MAPREDUCE-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875222#action_12875222 ] Koji Noguchi commented on MAPREDUCE-1839: - bq. I tested it yesterday on Hadoop 0.20 and it doesn't work. Could you clarify what didn't work? If the mapreduce archive job failed with unknown param, maybe you don't have MAPREDUCE-826 which sets the ToolRunner. I tried just now. Got {noformat} % hadoop dfs -lsr mytest1.har -rw--- 5 knoguchi users947 2010-06-03 17:47 /user/knoguchi/mytest1.har/_index -rw--- 5 knoguchi users 23 2010-06-03 17:47 /user/knoguchi/mytest1.har/_masterindex -rw--- 2 knoguchi users 68064 2010-06-03 17:46 /user/knoguchi/mytest1.har/part-0 % {noformat} Replication was successfully set to 2. Maybe you're talking about the replication shown when doing listStatus on the files inside the har ? When I do hadoop dfs -lsr har:///user/knoguchi/mytest1.har , it shows {noformat} ... -rw--- 5 knoguchi users 17018 2010-06-03 17:47 /user/knoguchi/mytest1.har/tmptmp/abc {noformat} This is because permission and replication factor is simply taken from the _index file. This is fixed in MAPREDUCE-1628. > HadoopArchives should provide a way to configure replication > > > Key: MAPREDUCE-1839 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1839 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive >Affects Versions: 0.20.1 >Reporter: Ramkumar Vadali >Priority: Minor > > When creating HAR archives, the part files use the default replication of the > filesystem. This should be made configurable through either the configuration > file or command line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1839) HadoopArchives should provide a way to configure replication
[ https://issues.apache.org/jira/browse/MAPREDUCE-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874909#action_12874909 ] Koji Noguchi commented on MAPREDUCE-1839: - Probably a silly question, but can't we set it through the command line? {noformat} % hadoop archive -Ddfs.replication=2 ... {noformat} > HadoopArchives should provide a way to configure replication > > > Key: MAPREDUCE-1839 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1839 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive >Affects Versions: 0.20.1 >Reporter: Ramkumar Vadali >Priority: Minor > > When creating HAR archives, the part files use the default replication of the > filesystem. This should be made configurable through either the configuration > file or command line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1648) Use RollingFileAppender to limit tasklogs
[ https://issues.apache.org/jira/browse/MAPREDUCE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854051#action_12854051 ] Koji Noguchi commented on MAPREDUCE-1648: - When reviewing the patch, please test the performance and make sure we don't re-introduce the slowness observed at HADOOP-1553. > Use RollingFileAppender to limit tasklogs > - > > Key: MAPREDUCE-1648 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1648 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: tasktracker >Reporter: Guilin Sun >Priority: Minor > > There are at least two types of task-logs: syslog and stdlog > Task-Jvm outputs syslog by log4j with TaskLogAppender, TaskLogAppender looks > just like "tail -c", it stores last N byte/line logs in memory(via queue), > and do real output only if all logs is commit and Appender is going to close. > The common problem of TaskLogAppender and 'tail -c' is keep everything in > memory and user can't see any log output while task is in progress. > So I'm going to try RollingFileAppender instead of TaskLogAppender, use > MaxFileSize&MaxBackupIndex to limit log file size. > RollingFileAppender is also suitable for stdout/stderr, just redirect > stdout/stderr to log4j via LoggingOutputStream, no client code have to be > changed, and RollingFileAppender seems better than 'tail -c' too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-114) All reducer tasks are finished, while some mapper tasks are still running
[ https://issues.apache.org/jira/browse/MAPREDUCE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844220#action_12844220 ] Koji Noguchi commented on MAPREDUCE-114: Is this related to MAPREDUCE-1060 ? > All reducer tasks are finished, while some mapper tasks are still running > - > > Key: MAPREDUCE-114 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-114 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Qi Liu > Attachments: hadoop-bug-overview.png, hadoop-bug-useless-task.png > > > In a high load environment (i.e. multiple jobs are queued up to be executed), > when all reducer tasks of a job are finished, some mapper tasks of the same > job may still running (possibly re-executed due to lost task tracker, etc). > This should not happen when a job has at least one reducer task. When all > reducer tasks are in SUCCEEDED state, the Hadoop JobTracker should kill all > running mapper tasks, since execution would be meaningless. The job should > also switch to SUCCEEDED state when all reducer tasks of that job succeeded > successfully. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020
[ https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi reassigned MAPREDUCE-837: -- Assignee: Mahadev konar > harchive fail when output directory has URI with default port of 8020 > - > > Key: MAPREDUCE-837 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-837 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: harchive >Affects Versions: 0.20.1 >Reporter: Koji Noguchi >Assignee: Mahadev konar >Priority: Minor > > % hadoop archive -archiveName abc.har /user/knoguchi/abc > hdfs://mynamenode:8020/user/knoguchi > doesn't work on 0.18 nor 0.20 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1305) Massive performance problem with DistCp and -delete
[ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831893#action_12831893 ] Koji Noguchi commented on MAPREDUCE-1305: - bq. Is supporting Trash useful for DistCp users running with -delete? To me, yes. I've seen many of our users deleting their files accidentally. Trash has saved us great time. I'd like to request the Trash part to stay if there's not much performance problem. > Massive performance problem with DistCp and -delete > --- > > Key: MAPREDUCE-1305 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.20.1 >Reporter: Peter Romianowski >Assignee: Peter Romianowski > Attachments: M1305-1.patch, MAPREDUCE-1305.patch > > > *First problem* > In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus > objects when the path is all we need. > The performance problem comes from > org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries > to retrieve file permissions by issuing a "ls -ld " which is painfully > slow. > Changed that to just serialize Path and not FileStatus. > *Second problem* > To delete the files we invoke the "hadoop" command line tool with option > "-rmr ". Again, for each file. > Changed that to dstfs.delete(path, true) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1372) ConcurrentModificationException in JobInProgress
[ https://issues.apache.org/jira/browse/MAPREDUCE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799247#action_12799247 ] Koji Noguchi commented on MAPREDUCE-1372: - When we hit this, that task never get scheduled and job would stuck forever. > ConcurrentModificationException in JobInProgress > > > Key: MAPREDUCE-1372 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1372 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker >Affects Versions: 0.20.1 >Reporter: Amareshwari Sriramadasu >Priority: Blocker > Fix For: 0.21.0 > > > We have seen the following ConcurrentModificationException in one of our > clusters > {noformat} > java.io.IOException: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793) > at java.util.HashMap$KeyIterator.next(HashMap.java:828) > at > org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:2018) > at > org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:1077) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:796) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:589) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:677) > at > org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:348) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.addMapTask(CapacityTaskScheduler.java:1397) > at > org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1349) > at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2976) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception
[ https://issues.apache.org/jira/browse/MAPREDUCE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-826: --- Attachment: mapreduce-826-2.patch Thanks Mahadev. Made the change. Since this is a patch around main, didn't find a straight forward way to do a unit test. One manual test. Before the patch. $ hadoop archive -archiveName myhar.har -p /tmp/somenonexistdir somedir /user/knoguchi null $ echo $? 0 After the patch, $ hadoop archive -archiveName myhar.har -p /tmp/somenonexistdir somedir /user/knoguchi Exception in archives null lieliftbean-lm:trunk knoguchi$ echo $? 1 I guess we should also fix the NPE when src doesn't exist. I'm leaving it for now since this was a good manual test case. > harchive doesn't use ToolRunner / harchive returns 0 even if the job fails > with exception > - > > Key: MAPREDUCE-826 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-826 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: harchive >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Fix For: 0.21.0 > > Attachments: mapreduce-826-1.patch, mapreduce-826-2.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-883) harchive: Document how to unarchive
[ https://issues.apache.org/jira/browse/MAPREDUCE-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-883: --- Attachment: mapreduce-883-0.patch Simple doc suggesting to use cp/distcp for unarchiving. > harchive: Document how to unarchive > --- > > Key: MAPREDUCE-883 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-883 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: documentation, harchive >Reporter: Koji Noguchi >Priority: Minor > Attachments: mapreduce-883-0.patch > > > I was thinking of implementing harchive's 'unarchive' feature, but realized > it has been implemented already ever since harchive was introduced. > It just needs to be documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex
[ https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744331#action_12744331 ] Koji Noguchi commented on MAPREDUCE-865: Simple testing. Created har file with /a/b/2000files/xa to xaadnj and /a/b/2000files/2000files/xa to xaadnj Created har archive called myarchive.har. About 4500 files. Withot the patch, /usr/bin/time hadoop dfs -lsr har:///user/knoguchi/myarchive.har > /dev/null 31.72user 5.23system *1:13.19* elapsed 50%CPU (0avgtext+0avgdata 0maxresident) with 9000 open calls to Namenode. (_masterindex and _index) and also 4500 filestatus calls to _index (I think). With the patch, 23.59user 0.58system *0:22.97* elapsed 105%CPU (0avgtext+0avgdata 0maxresident) with one _master open call and five _index open calls. Setting -Dfs.har.indexcache.num=1 changed the number of _index open calls to 10 times, but elapsed time didn't change much. The goal of the patch is more for reducing the load/calls to the namenode than speeding up the 'ls' commands. Note that since client caches the entire _masterindex and also caches each STORE(cache range) it reads, initial call would be slower. > harchive: Reduce the number of open calls to _index and _masterindex > -- > > Key: MAPREDUCE-865 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-865 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive >Reporter: Koji Noguchi >Priority: Minor > Attachments: mapreduce-865-0.patch > > > When I have har file with 1000 files in it, >% hadoop dfs -lsr har:///user/knoguchi/myhar.har/ > would open/read/close the _index/_masterindex files 1000 times. > This makes the client slow and add some load to the namenode as well. > Any ways to reduce this number? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex
[ https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-865: --- Attachment: mapreduce-865-0.patch Primitive patch for discussion. bq. So instead of open->read->close _index for each part file, thinking of keeping the index file open when possible. Instead of keeping an open handle, this one simply reads 'Stores' (range of caches) and keep last 5 of them (configurable) in memory. If the files are typical mapreduce outputs with many part-* files, number of open calls to _index will be significantly reduced. > harchive: Reduce the number of open calls to _index and _masterindex > -- > > Key: MAPREDUCE-865 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-865 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive >Reporter: Koji Noguchi >Priority: Minor > Attachments: mapreduce-865-0.patch > > > When I have har file with 1000 files in it, >% hadoop dfs -lsr har:///user/knoguchi/myhar.har/ > would open/read/close the _index/_masterindex files 1000 times. > This makes the client slow and add some load to the namenode as well. > Any ways to reduce this number? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-883) harchive: Document how to unarchive
harchive: Document how to unarchive --- Key: MAPREDUCE-883 URL: https://issues.apache.org/jira/browse/MAPREDUCE-883 Project: Hadoop Map/Reduce Issue Type: Improvement Components: documentation, harchive Reporter: Koji Noguchi Priority: Minor I was thinking of implementing harchive's 'unarchive' feature, but realized it has been implemented already ever since harchive was introduced. It just needs to be documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex
[ https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041 ] Koji Noguchi commented on MAPREDUCE-865: I believe _masterindex is probably small enough to fit in memory(cache) For _index file, 1 million files can correspond to _index size of 100MBytes. (It depend on the path length) Creating a local copy could be costly. In our clusters, most of the files are mapreduce output files. /a/b/part-0 /a/b/part-1 /a/b/part-2 ... These show up as a set in _index file in this order since HarFileSystem.getHarHash is written that way. So instead of open->read->close _index for each part file, thinking of keeping the index file open when possible. > harchive: Reduce the number of open calls to _index and _masterindex > -- > > Key: MAPREDUCE-865 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-865 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive >Reporter: Koji Noguchi >Priority: Minor > > When I have har file with 1000 files in it, >% hadoop dfs -lsr har:///user/knoguchi/myhar.har/ > would open/read/close the _index/_masterindex files 1000 times. > This makes the client slow and add some load to the namenode as well. > Any ways to reduce this number? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex
harchive: Reduce the number of open calls to _index and _masterindex -- Key: MAPREDUCE-865 URL: https://issues.apache.org/jira/browse/MAPREDUCE-865 Project: Hadoop Map/Reduce Issue Type: Improvement Components: harchive Reporter: Koji Noguchi Priority: Minor When I have har file with 1000 files in it, % hadoop dfs -lsr har:///user/knoguchi/myhar.har/ would open/read/close the _index/_masterindex files 1000 times. This makes the client slow and add some load to the namenode as well. Any ways to reduce this number? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020
[ https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740697#action_12740697 ] Koji Noguchi commented on MAPREDUCE-837: bq. I'll create a separate Jira for the 0.20 job succeeding part. Created MAPREDUCE-838 > harchive fail when output directory has URI with default port of 8020 > - > > Key: MAPREDUCE-837 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-837 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: harchive >Affects Versions: 0.20.1 >Reporter: Koji Noguchi >Priority: Minor > > % hadoop archive -archiveName abc.har /user/knoguchi/abc > hdfs://mynamenode:8020/user/knoguchi > doesn't work on 0.18 nor 0.20 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-838) Task succeeds even when committer.commitTask fails with IOException
Task succeeds even when committer.commitTask fails with IOException --- Key: MAPREDUCE-838 URL: https://issues.apache.org/jira/browse/MAPREDUCE-838 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1 Reporter: Koji Noguchi In MAPREDUCE-837, job succeeded with empty output even though all the tasks were throwing IOException at commiter.commitTask. {noformat} 2009-08-07 17:51:47,458 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_200907301448_8771_r_00_0 is allowed to commit now 2009-08-07 17:51:47,466 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing: java.io.IOException: Can not get the relative path: \ base = hdfs://mynamenode:8020/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0 \ child = hdfs://mynamenode/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0/_index at org.apache.hadoop.mapred.FileOutputCommitter.getFinalPath(FileOutputCommitter.java:150) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:106) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86) at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171) at org.apache.hadoop.mapred.Task.commit(Task.java:768) at org.apache.hadoop.mapred.Task.done(Task.java:692) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2009-08-07 17:51:47,468 WARN org.apache.hadoop.mapred.TaskRunner: Failure asking whether task can commit: java.io.IOException: \ Can not get the relative path: base = hdfs://mynamenode:8020/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0 \ child = hdfs://mynamenode/user/knoguchi/test2.har/_temporary/_attempt_200907301448_8771_r_00_0/_index at org.apache.hadoop.mapred.FileOutputCommitter.getFinalPath(FileOutputCommitter.java:150) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:106) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86) at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171) at org.apache.hadoop.mapred.Task.commit(Task.java:768) at org.apache.hadoop.mapred.Task.done(Task.java:692) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2009-08-07 17:51:47,469 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_200907301448_8771_r_00_0 is allowed to commit now 2009-08-07 17:51:47,472 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200907301448_8771_r_00_0' done. {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020
harchive fail when output directory has URI with default port of 8020 - Key: MAPREDUCE-837 URL: https://issues.apache.org/jira/browse/MAPREDUCE-837 Project: Hadoop Map/Reduce Issue Type: Bug Components: harchive Affects Versions: 0.20.1 Reporter: Koji Noguchi Priority: Minor % hadoop archive -archiveName abc.har /user/knoguchi/abc hdfs://mynamenode:8020/user/knoguchi doesn't work on 0.18 nor 0.20 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-837) harchive fail when output directory has URI with default port of 8020
[ https://issues.apache.org/jira/browse/MAPREDUCE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740692#action_12740692 ] Koji Noguchi commented on MAPREDUCE-837: hadoop archive -archiveName abc.har /user/knoguchi/abc hdfs://mynamenode:8020/user/knoguchi in 0.18, job fails with {noformat} 09/08/07 19:41:57 INFO mapred.JobClient: Task Id : attempt_200908071938_0001_m_00_2, Status : FAILED Failed to rename output with the exception: java.io.IOException: Can not get the relative path: base = hdfs://mynamenode:8020/user/knoguchi/abc.har/_temporary/_attempt_200908071938_0001_m_00_2 child = hdfs://mynamenode/user/knoguchi/abc.har/_temporary/_attempt_200908071938_0001_m_00_2/part-0 at org.apache.hadoop.mapred.Task.getFinalPath(Task.java:590) at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:603) at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:621) at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:565) at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2616) {noformat} in 0.20, it logs the above warning but job succeeds with empty output directory. (which is worse) I'll create a separate Jira for the 0.20 job succeeding part. > harchive fail when output directory has URI with default port of 8020 > - > > Key: MAPREDUCE-837 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-837 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: harchive >Affects Versions: 0.20.1 >Reporter: Koji Noguchi >Priority: Minor > > % hadoop archive -archiveName abc.har /user/knoguchi/abc > hdfs://mynamenode:8020/user/knoguchi > doesn't work on 0.18 nor 0.20 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception
[ https://issues.apache.org/jira/browse/MAPREDUCE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated MAPREDUCE-826: --- Attachment: mapreduce-826-1.patch 1) Calls ToolRunner.run 2) Took out catch(Excepttion e) and let main fail with stack dump. At least return value would be non-zero. > harchive doesn't use ToolRunner / harchive returns 0 even if the job fails > with exception > - > > Key: MAPREDUCE-826 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-826 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: harchive >Reporter: Koji Noguchi >Priority: Trivial > Attachments: mapreduce-826-1.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-826) harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception
harchive doesn't use ToolRunner / harchive returns 0 even if the job fails with exception - Key: MAPREDUCE-826 URL: https://issues.apache.org/jira/browse/MAPREDUCE-826 Project: Hadoop Map/Reduce Issue Type: Bug Components: harchive Reporter: Koji Noguchi Priority: Trivial -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.