[jira] [Commented] (MAPREDUCE-6954) Disable erasure coding for files that are uploaded to the MR staging area
[ https://issues.apache.org/jira/browse/MAPREDUCE-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162283#comment-16162283 ] Robert Kanter commented on MAPREDUCE-6954: -- Overall looks good. A couple minor things: - Update mapred-default like you suggested - The staging dir config is {{yarn.app.mapreduce.am.staging-dir}} and defined in MRJobConfig as {{MR_AM_STAGING_DIR = MR_AM_PREFIX+"staging-dir"}}. I think we should be consistent with this and make {{MR_JOB_STAGING_ERASURECODING_ENABLED = "mapreduce.job.staging-dir.erasurecoding.enabled"}} be {{MR_AM_STAGING_DIR_ERASURECODING_ENABLED = MR_AM_STAGING_DIR + "erasurecoding.enabled"}} which would resolve to {{yarn.app.mapreduce.am.staging-dir.erasurecoding.enabled}}. - It would be good to add a unit test. > Disable erasure coding for files that are uploaded to the MR staging area > - > > Key: MAPREDUCE-6954 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6954 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client >Affects Versions: 3.0.0-alpha4 >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-6954-001.patch > > > Depending on the encoder/decoder used and the type or MR workload, EC might > negatively affect the performance of an MR job if too many files are > localized. > In such a scenario, users might want to disable EC in the staging area to > speed up the execution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Work started] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on MAPREDUCE-6957 started by Jooseong Kim. --- > shuffle hangs after a node manager connection timeout > - > > Key: MAPREDUCE-6957 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Reporter: Jooseong Kim >Assignee: Jooseong Kim > Attachments: MAPREDUCE-6957.001.patch > > > After a connection failure from the reducer to the node manager, shuffles > started to hang with the following message: > {code} > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > {code} > There are two problems that leads to the hang. > Problem 1. > When a reducer has an issue connecting to the node manager, copyFromHost may > call putBackKnownMapOutput on the same task attempt multiple times. > There are two call sites of putBackKnownMapOutput in copyFromHost since > MAPREDUCE-6303: > 1. In the finally block of copyFromHost > 2. In the catch block of openShuffleUrl. > When openShuffleUrl fails to connect from the catch block in copyFromHost, it > returns null. > By the time openShuffleUrl returns null, putBackKnownMapOutput would have > been called already for all remaining map outputs. > However, the finally block calls putBackKnownMapOutput one more time on the > map outputs. > Problem 2. Problem 1 causes a leak in MergeManager. > The problem occurs when multiple fetchers get the same set of map attempt > outputs to fetch. > Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput > for the same map outputs. > When the fetch succeeds, only the first map output gets committed through > ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because > commit() is gated by !finishedMaps[mapIndex]. > This may lead to a condition where usedMemory > memoryLimit, while > commitMemory < mergeThreshold. > This gets the MergeManager into a deadlock where a merge is never triggered > while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6441) Improve temporary directory name generation in LocalDistributedCacheManager for concurrent processes
[ https://issues.apache.org/jira/browse/MAPREDUCE-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162181#comment-16162181 ] Haibo Chen commented on MAPREDUCE-6441: --- I see. Talking with Daniel, there is really no good way to do this. Can we then add a javadoc to this new method to explain the issue with the test? As suggested by Daniel, we should probably use Barrier (All threads wait on the barrier and get notified at the same time) which will give us better chance to reproduce this. > Improve temporary directory name generation in LocalDistributedCacheManager > for concurrent processes > > > Key: MAPREDUCE-6441 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6441 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: William Watson >Assignee: Ray Chiang > Attachments: HADOOP-10924.02.patch, > HADOOP-10924.03.jobid-plus-uuid.patch, MAPREDUCE-6441.004.patch, > MAPREDUCE-6441.005.patch, MAPREDUCE-6441.006.patch > > > Kicking off many sqoop processes in different threads results in: > {code} > 2014-08-01 13:47:24 -0400: INFO - 14/08/01 13:47:22 ERROR tool.ImportTool: > Encountered IOException running import job: java.io.IOException: > java.util.concurrent.ExecutionException: java.io.IOException: Rename cannot > overwrite non empty destination directory > /tmp/hadoop-hadoop/mapred/local/1406915233073 > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:149) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) > 2014-08-01 13:47:24 -0400: INFO -at > java.security.AccessController.doPrivileged(Native Method) > 2014-08-01 13:47:24 -0400: INFO -at > javax.security.auth.Subject.doAs(Subject.java:415) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:186) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:159) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:239) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:645) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:415) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.run(Sqoop.java:145) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) > 2014-08-01 13:47:24 -0400: INFO -at > org.apache.sqoop.Sqoop.main(Sqoop.java:238) > {code} > If two are kicked off in the same second. The issue is the following lines of > code in the org.apache.hadoop.mapred.LocalDistributedCacheManager class: > {code} > // Generating unique numbers for FSDownload. > AtomicLong uniqueNumberGenerator = >new AtomicLong(System.currentTimeMillis()); > {code} > and > {code} > Long.toString(uniqueNumberGenerator.incrementAndGet())), > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6892) Issues with the count of failed/killed tasks in the jhist file
[ https://issues.apache.org/jira/browse/MAPREDUCE-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162092#comment-16162092 ] Andrew Wang commented on MAPREDUCE-6892: Peter, do you mind adding a release note to this JIRA summarizing the impact for our end users? Thanks! > Issues with the count of failed/killed tasks in the jhist file > -- > > Key: MAPREDUCE-6892 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6892 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: client, jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Fix For: 3.0.0-beta1 > > Attachments: MAPREDUCE-6892-001.patch, MAPREDUCE-6892-002.PATCH, > MAPREDUCE-6892-003.patch, MAPREDUCE-6892-004.patch, MAPREDUCE-6892-005.patch, > MAPREDUCE-6892-006.patch > > > Recently we encountered some issues with the value of failed tasks. After > parsing the jhist file, {{JobInfo.getFailedMaps()}} returned 0, but actually > there were failures. > Another minor thing is that you cannot get the number of killed tasks > (although this can be calculated). > The root cause is that {{JobUnsuccessfulCompletionEvent}} contains only the > successful map/reduce task counts. Number of failed (or killed) tasks are not > stored. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6870) Add configuration for MR job to finish when all reducers are complete (even with unfinished mappers)
[ https://issues.apache.org/jira/browse/MAPREDUCE-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated MAPREDUCE-6870: --- Release Note: Enables {{mapreduce.job.finish-when-all-reducers-done}} by default. With this enabled, a MapReduce job will complete as soon as all of its reducers are complete, even if some mappers are still running. This can occur if a mapper was relaunched after node failure but the relaunched task's output is not actually needed. Previously the job would wait for all mappers to complete. > Add configuration for MR job to finish when all reducers are complete (even > with unfinished mappers) > > > Key: MAPREDUCE-6870 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6870 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.6.1 >Reporter: Zhe Zhang >Assignee: Peter Bacsko > Fix For: 3.0.0-beta1 > > Attachments: MAPREDUCE-6870-001.patch, MAPREDUCE-6870-002.patch, > MAPREDUCE-6870-003.patch, MAPREDUCE-6870-004.patch, MAPREDUCE-6870-005.patch, > MAPREDUCE-6870-006.patch, MAPREDUCE-6870-007.patch > > > Even with MAPREDUCE-5817, there could still be cases where mappers get > scheduled before all reducers are complete, but those mappers run for long > time, even after all reducers are complete. This could hurt the performance > of large MR jobs. > In some cases, mappers don't have any materialize-able outcome other than > providing intermediate data to reducers. In that case, the job owner should > have the config option to finish the job once all reducers are complete. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6870) Add configuration for MR job to finish when all reducers are complete (even with unfinished mappers)
[ https://issues.apache.org/jira/browse/MAPREDUCE-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162015#comment-16162015 ] Erik Krogen commented on MAPREDUCE-6870: Good idea, [~andrew.wang], thanks for the reminder. Done. > Add configuration for MR job to finish when all reducers are complete (even > with unfinished mappers) > > > Key: MAPREDUCE-6870 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6870 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.6.1 >Reporter: Zhe Zhang >Assignee: Peter Bacsko > Fix For: 3.0.0-beta1 > > Attachments: MAPREDUCE-6870-001.patch, MAPREDUCE-6870-002.patch, > MAPREDUCE-6870-003.patch, MAPREDUCE-6870-004.patch, MAPREDUCE-6870-005.patch, > MAPREDUCE-6870-006.patch, MAPREDUCE-6870-007.patch > > > Even with MAPREDUCE-5817, there could still be cases where mappers get > scheduled before all reducers are complete, but those mappers run for long > time, even after all reducers are complete. This could hurt the performance > of large MR jobs. > In some cases, mappers don't have any materialize-able outcome other than > providing intermediate data to reducers. In that case, the job owner should > have the config option to finish the job once all reducers are complete. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6870) Add configuration for MR job to finish when all reducers are complete (even with unfinished mappers)
[ https://issues.apache.org/jira/browse/MAPREDUCE-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated MAPREDUCE-6870: --- Release Note: Enables mapreduce.job.finish-when-all-reducers-done by default. With this enabled, a MapReduce job will complete as soon as all of its reducers are complete, even if some mappers are still running. This can occur if a mapper was relaunched after node failure but the relaunched task's output is not actually needed. Previously the job would wait for all mappers to complete. (was: Enables {{mapreduce.job.finish-when-all-reducers-done}} by default. With this enabled, a MapReduce job will complete as soon as all of its reducers are complete, even if some mappers are still running. This can occur if a mapper was relaunched after node failure but the relaunched task's output is not actually needed. Previously the job would wait for all mappers to complete.) > Add configuration for MR job to finish when all reducers are complete (even > with unfinished mappers) > > > Key: MAPREDUCE-6870 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6870 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.6.1 >Reporter: Zhe Zhang >Assignee: Peter Bacsko > Fix For: 3.0.0-beta1 > > Attachments: MAPREDUCE-6870-001.patch, MAPREDUCE-6870-002.patch, > MAPREDUCE-6870-003.patch, MAPREDUCE-6870-004.patch, MAPREDUCE-6870-005.patch, > MAPREDUCE-6870-006.patch, MAPREDUCE-6870-007.patch > > > Even with MAPREDUCE-5817, there could still be cases where mappers get > scheduled before all reducers are complete, but those mappers run for long > time, even after all reducers are complete. This could hurt the performance > of large MR jobs. > In some cases, mappers don't have any materialize-able outcome other than > providing intermediate data to reducers. In that case, the job owner should > have the config option to finish the job once all reducers are complete. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6870) Add configuration for MR job to finish when all reducers are complete (even with unfinished mappers)
[ https://issues.apache.org/jira/browse/MAPREDUCE-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161990#comment-16161990 ] Andrew Wang commented on MAPREDUCE-6870: Hi Erik, do you mind adding a release note summarizing the incompatibility? Would be nice for our end users. > Add configuration for MR job to finish when all reducers are complete (even > with unfinished mappers) > > > Key: MAPREDUCE-6870 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6870 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.6.1 >Reporter: Zhe Zhang >Assignee: Peter Bacsko > Fix For: 3.0.0-beta1 > > Attachments: MAPREDUCE-6870-001.patch, MAPREDUCE-6870-002.patch, > MAPREDUCE-6870-003.patch, MAPREDUCE-6870-004.patch, MAPREDUCE-6870-005.patch, > MAPREDUCE-6870-006.patch, MAPREDUCE-6870-007.patch > > > Even with MAPREDUCE-5817, there could still be cases where mappers get > scheduled before all reducers are complete, but those mappers run for long > time, even after all reducers are complete. This could hurt the performance > of large MR jobs. > In some cases, mappers don't have any materialize-able outcome other than > providing intermediate data to reducers. In that case, the job owner should > have the config option to finish the job once all reducers are complete. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6937) Backport MAPREDUCE-6870 to branch-2 while preserving compatibility
[ https://issues.apache.org/jira/browse/MAPREDUCE-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161910#comment-16161910 ] Erik Krogen commented on MAPREDUCE-6937: Big thanks to [~pbacsko] and [~haibo.chen] for working on this and helping us to backport! It is much appreciated. > Backport MAPREDUCE-6870 to branch-2 while preserving compatibility > -- > > Key: MAPREDUCE-6937 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6937 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Peter Bacsko > Fix For: 2.9.0, 2.8.2, 2.7.5 > > Attachments: MAPREDUCE-6870-branch-2.01.patch, > MAPREDUCE-6870-branch-2.02.patch, MAPREDUCE-6870-branch-2.7.03.patch, > MAPREDUCE-6870-branch-2.7.04.patch, MAPREDUCE-6870-branch-2.7.05.patch, > MAPREDUCE-6870_branch2.7.patch, MAPREDUCE-6870_branch2.7v2.patch, > MAPREDUCE-6870-branch-2.8.03.patch, MAPREDUCE-6870-branch-2.8.04.patch, > MAPREDUCE-6870_branch2.8.patch, MAPREDUCE-6870_branch2.8v2.patch > > > To maintain compatibility we need to disable this by default per discussion > on MAPREDUCE-6870. > Using a separate JIRA to correctly track incompatibilities. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-6957: -- Target Version/s: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3 > shuffle hangs after a node manager connection timeout > - > > Key: MAPREDUCE-6957 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Reporter: Jooseong Kim >Assignee: Jooseong Kim > Attachments: MAPREDUCE-6957.001.patch > > > After a connection failure from the reducer to the node manager, shuffles > started to hang with the following message: > {code} > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > {code} > There are two problems that leads to the hang. > Problem 1. > When a reducer has an issue connecting to the node manager, copyFromHost may > call putBackKnownMapOutput on the same task attempt multiple times. > There are two call sites of putBackKnownMapOutput in copyFromHost since > MAPREDUCE-6303: > 1. In the finally block of copyFromHost > 2. In the catch block of openShuffleUrl. > When openShuffleUrl fails to connect from the catch block in copyFromHost, it > returns null. > By the time openShuffleUrl returns null, putBackKnownMapOutput would have > been called already for all remaining map outputs. > However, the finally block calls putBackKnownMapOutput one more time on the > map outputs. > Problem 2. Problem 1 causes a leak in MergeManager. > The problem occurs when multiple fetchers get the same set of map attempt > outputs to fetch. > Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput > for the same map outputs. > When the fetch succeeds, only the first map output gets committed through > ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because > commit() is gated by !finishedMaps[mapIndex]. > This may lead to a condition where usedMemory > memoryLimit, while > commitMemory < mergeThreshold. > This gets the MergeManager into a deadlock where a merge is never triggered > while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Assigned] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned MAPREDUCE-6957: - Assignee: Jooseong Kim Thanks for the report and the patch! bq. When the fetch succeeds, only the first map output gets committed through ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because commit() is gated by !finishedMaps\[mapIndex\]. This looks like another latent bug. If, for whatever reason, we try to report a fetch completed for a map that has already completed fetching then it should call output.abort() so we unreserve the memory. Even with the redundant fetching caused by the double put-back of known map outputs, that unreserve fix would have prevented the merge manager hang. Would you mind updating the patch to address the missing unreserve? The rest of the patch looks good to me. > shuffle hangs after a node manager connection timeout > - > > Key: MAPREDUCE-6957 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Reporter: Jooseong Kim >Assignee: Jooseong Kim > Attachments: MAPREDUCE-6957.001.patch > > > After a connection failure from the reducer to the node manager, shuffles > started to hang with the following message: > {code} > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > {code} > There are two problems that leads to the hang. > Problem 1. > When a reducer has an issue connecting to the node manager, copyFromHost may > call putBackKnownMapOutput on the same task attempt multiple times. > There are two call sites of putBackKnownMapOutput in copyFromHost since > MAPREDUCE-6303: > 1. In the finally block of copyFromHost > 2. In the catch block of openShuffleUrl. > When openShuffleUrl fails to connect from the catch block in copyFromHost, it > returns null. > By the time openShuffleUrl returns null, putBackKnownMapOutput would have > been called already for all remaining map outputs. > However, the finally block calls putBackKnownMapOutput one more time on the > map outputs. > Problem 2. Problem 1 causes a leak in MergeManager. > The problem occurs when multiple fetchers get the same set of map attempt > outputs to fetch. > Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput > for the same map outputs. > When the fetch succeeds, only the first map output gets committed through > ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because > commit() is gated by !finishedMaps[mapIndex]. > This may lead to a condition where usedMemory > memoryLimit, while > commitMemory < mergeThreshold. > This gets the MergeManager into a deadlock where a merge is never triggered > while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161742#comment-16161742 ] daemeon reiydelle commented on MAPREDUCE-6957: -- I always wondered what you were doing buried in the Java code every time I walked up. Thank you for your hard work! It has made supporting Hadoop at scale so much easier. *Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872* On Mon, Sep 11, 2017 at 11:26 AM, Jooseong Kim (JIRA) > shuffle hangs after a node manager connection timeout > - > > Key: MAPREDUCE-6957 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Reporter: Jooseong Kim > Attachments: MAPREDUCE-6957.001.patch > > > After a connection failure from the reducer to the node manager, shuffles > started to hang with the following message: > {code} > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > {code} > There are two problems that leads to the hang. > Problem 1. > When a reducer has an issue connecting to the node manager, copyFromHost may > call putBackKnownMapOutput on the same task attempt multiple times. > There are two call sites of putBackKnownMapOutput in copyFromHost since > MAPREDUCE-6303: > 1. In the finally block of copyFromHost > 2. In the catch block of openShuffleUrl. > When openShuffleUrl fails to connect from the catch block in copyFromHost, it > returns null. > By the time openShuffleUrl returns null, putBackKnownMapOutput would have > been called already for all remaining map outputs. > However, the finally block calls putBackKnownMapOutput one more time on the > map outputs. > Problem 2. Problem 1 causes a leak in MergeManager. > The problem occurs when multiple fetchers get the same set of map attempt > outputs to fetch. > Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput > for the same map outputs. > When the fetch succeeds, only the first map output gets committed through > ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because > commit() is gated by !finishedMaps[mapIndex]. > This may lead to a condition where usedMemory > memoryLimit, while > commitMemory < mergeThreshold. > This gets the MergeManager into a deadlock where a merge is never triggered > while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
[ https://issues.apache.org/jira/browse/MAPREDUCE-6957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jooseong Kim updated MAPREDUCE-6957: Attachment: MAPREDUCE-6957.001.patch The patch removes the call to putBackKnownMapOutput from openShuffleUrl and leaves only one call site in copyFromHost. > shuffle hangs after a node manager connection timeout > - > > Key: MAPREDUCE-6957 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Reporter: Jooseong Kim > Attachments: MAPREDUCE-6957.001.patch > > > After a connection failure from the reducer to the node manager, shuffles > started to hang with the following message: > {code} > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager > returned status WAIT ... > {code} > There are two problems that leads to the hang. > Problem 1. > When a reducer has an issue connecting to the node manager, copyFromHost may > call putBackKnownMapOutput on the same task attempt multiple times. > There are two call sites of putBackKnownMapOutput in copyFromHost since > MAPREDUCE-6303: > 1. In the finally block of copyFromHost > 2. In the catch block of openShuffleUrl. > When openShuffleUrl fails to connect from the catch block in copyFromHost, it > returns null. > By the time openShuffleUrl returns null, putBackKnownMapOutput would have > been called already for all remaining map outputs. > However, the finally block calls putBackKnownMapOutput one more time on the > map outputs. > Problem 2. Problem 1 causes a leak in MergeManager. > The problem occurs when multiple fetchers get the same set of map attempt > outputs to fetch. > Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput > for the same map outputs. > When the fetch succeeds, only the first map output gets committed through > ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because > commit() is gated by !finishedMaps[mapIndex]. > This may lead to a condition where usedMemory > memoryLimit, while > commitMemory < mergeThreshold. > This gets the MergeManager into a deadlock where a merge is never triggered > while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6957) shuffle hangs after a node manager connection timeout
Jooseong Kim created MAPREDUCE-6957: --- Summary: shuffle hangs after a node manager connection timeout Key: MAPREDUCE-6957 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6957 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Jooseong Kim After a connection failure from the reducer to the node manager, shuffles started to hang with the following message: {code} org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager returned status WAIT ... {code} There are two problems that leads to the hang. Problem 1. When a reducer has an issue connecting to the node manager, copyFromHost may call putBackKnownMapOutput on the same task attempt multiple times. There are two call sites of putBackKnownMapOutput in copyFromHost since MAPREDUCE-6303: 1. In the finally block of copyFromHost 2. In the catch block of openShuffleUrl. When openShuffleUrl fails to connect from the catch block in copyFromHost, it returns null. By the time openShuffleUrl returns null, putBackKnownMapOutput would have been called already for all remaining map outputs. However, the finally block calls putBackKnownMapOutput one more time on the map outputs. Problem 2. Problem 1 causes a leak in MergeManager. The problem occurs when multiple fetchers get the same set of map attempt outputs to fetch. Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput for the same map outputs. When the fetch succeeds, only the first map output gets committed through ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because commit() is gated by !finishedMaps[mapIndex]. This may lead to a condition where usedMemory > memoryLimit, while commitMemory < mergeThreshold. This gets the MergeManager into a deadlock where a merge is never triggered while MergeManager cannot reserve additional space for map outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5889) Deprecate FileInputFormat.setInputPaths(Job, String) and FileInputFormat.addInputPaths(Job, String)
[ https://issues.apache.org/jira/browse/MAPREDUCE-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160933#comment-16160933 ] Hadoop QA commented on MAPREDUCE-5889: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 8 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 6s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 40s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 11m 40s{color} | {color:red} root generated 1 new + 1283 unchanged - 0 fixed = 1284 total (was 1283) {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 2m 21s{color} | {color:red} root: The patch generated 5 new + 1228 unchanged - 24 fixed = 1233 total (was 1252) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 21s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 29s{color} | {color:red} hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core generated 2 new + 2540 unchanged - 0 fixed = 2542 total (was 2540) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 48s{color} | {color:green} hadoop-mapreduce-client-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 98m 14s{color} | {color:green} hadoop-mapreduce-client-jobclient in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 47s{color} | {color:green} hadoop-mapreduce-examples in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 33s{color} | {color:green} hadoop-streaming in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 9s{color} | {color:green} hadoop-gridmix in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 38s{color} | {color:green} hadoop-datajoin in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}192m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:71bbb86 | | JIRA Issue | MAPREDUCE-5889 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12772347/MAPREDUCE-5889.5.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 05da4cad89ea 3.13.0-119-generic #166-Ub
[jira] [Commented] (MAPREDUCE-6498) ClientServiceDelegate should not retry upon AccessControlException
[ https://issues.apache.org/jira/browse/MAPREDUCE-6498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160850#comment-16160850 ] Hadoop QA commented on MAPREDUCE-6498: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: The patch generated 0 new + 86 unchanged - 4 fixed = 86 total (was 90) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}118m 27s{color} | {color:green} hadoop-mapreduce-client-jobclient in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}141m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:71bbb86 | | JIRA Issue | MAPREDUCE-6498 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12764388/MAPREDUCE-6498.1.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux fcda4e16e93c 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 722ee84 | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7127/testReport/ | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7127/console | | Powered by | Apache Yetus 0.5.0 http://yetus.apache.org | This message was automatically generated. > ClientServiceDelegate should not retry upon AccessControlException > -- > > Key: MAPREDUCE-6498 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6498 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Peng Zhang >Assignee: Peng Zhang >