[jira] [Commented] (MAPREDUCE-5951) Add support for the YARN Shared Cache
[ https://issues.apache.org/jira/browse/MAPREDUCE-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198205#comment-16198205 ] Ming Ma commented on MAPREDUCE-5951: +1. > Add support for the YARN Shared Cache > - > > Key: MAPREDUCE-5951 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5951 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5951-Overview.001.pdf, > MAPREDUCE-5951-trunk-020.patch, MAPREDUCE-5951-trunk-021.patch, > MAPREDUCE-5951-trunk-v1.patch, MAPREDUCE-5951-trunk-v10.patch, > MAPREDUCE-5951-trunk-v11.patch, MAPREDUCE-5951-trunk-v12.patch, > MAPREDUCE-5951-trunk-v13.patch, MAPREDUCE-5951-trunk-v14.patch, > MAPREDUCE-5951-trunk-v15.patch, MAPREDUCE-5951-trunk-v2.patch, > MAPREDUCE-5951-trunk-v3.patch, MAPREDUCE-5951-trunk-v4.patch, > MAPREDUCE-5951-trunk-v5.patch, MAPREDUCE-5951-trunk-v6.patch, > MAPREDUCE-5951-trunk-v7.patch, MAPREDUCE-5951-trunk-v8.patch, > MAPREDUCE-5951-trunk-v9.patch, MAPREDUCE-5951-trunk.016.patch, > MAPREDUCE-5951-trunk.017.patch, MAPREDUCE-5951-trunk.018.patch, > MAPREDUCE-5951-trunk.019.patch > > > Implement the necessary changes so that the MapReduce application can > leverage the new YARN shared cache (i.e. YARN-1492). > Specifically, allow per-job configuration so that MapReduce jobs can specify > which set of resources they would like to cache (i.e. jobjar, libjars, > archives, files). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5951) Add support for the YARN Shared Cache
[ https://issues.apache.org/jira/browse/MAPREDUCE-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194044#comment-16194044 ] Ming Ma commented on MAPREDUCE-5951: Thanks [~ctrezzo]. The code looks good overall. The only question I have at this point is if any code should be moved from MR to YARN to make it easier for other YARN applications to use shared cache. For example, maybe other applications can benefit from part of LocalResourceBuilder or the special care when dealing with fragment. > Add support for the YARN Shared Cache > - > > Key: MAPREDUCE-5951 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5951 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5951-Overview.001.pdf, > MAPREDUCE-5951-trunk.016.patch, MAPREDUCE-5951-trunk.017.patch, > MAPREDUCE-5951-trunk.018.patch, MAPREDUCE-5951-trunk.019.patch, > MAPREDUCE-5951-trunk-020.patch, MAPREDUCE-5951-trunk-021.patch, > MAPREDUCE-5951-trunk-v10.patch, MAPREDUCE-5951-trunk-v11.patch, > MAPREDUCE-5951-trunk-v12.patch, MAPREDUCE-5951-trunk-v13.patch, > MAPREDUCE-5951-trunk-v14.patch, MAPREDUCE-5951-trunk-v15.patch, > MAPREDUCE-5951-trunk-v1.patch, MAPREDUCE-5951-trunk-v2.patch, > MAPREDUCE-5951-trunk-v3.patch, MAPREDUCE-5951-trunk-v4.patch, > MAPREDUCE-5951-trunk-v5.patch, MAPREDUCE-5951-trunk-v6.patch, > MAPREDUCE-5951-trunk-v7.patch, MAPREDUCE-5951-trunk-v8.patch, > MAPREDUCE-5951-trunk-v9.patch > > > Implement the necessary changes so that the MapReduce application can > leverage the new YARN shared cache (i.e. YARN-1492). > Specifically, allow per-job configuration so that MapReduce jobs can specify > which set of resources they would like to cache (i.e. jobjar, libjars, > archives, files). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task
[ https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979625#comment-15979625 ] Ming Ma commented on MAPREDUCE-6829: [~miklos.szeg...@cloudera.com] if each MR task can somehow record the reference to its container id, then end users can get the data via taskId -> containerId -> containerUsage. Sure such approach is only useful when we expect more container metrics at the yarn layer will be added thus frameworks like MR can get the new metrics automatically. > Add peak memory usage counter for each task > --- > > Key: MAPREDUCE-6829 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv2 >Reporter: Yufei Gu >Assignee: Miklos Szegedi > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, > MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, > MAPREDUCE-6829.005.patch > > > Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which > are snapshots of memory usage of that task. They are not sufficient for users > to understand peak memory usage by that task, e.g. in order to diagnose task > failures, tune job parameters or change application design. This new feature > will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and > VIRTUAL_MEMORY_BYTES_MAX. > This JIRA has the same feature from MAPREDUCE-4710. I file this new YARN > JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less > assumes a branch-1 architecture, should be close at this point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task
[ https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954401#comment-15954401 ] Ming Ma commented on MAPREDUCE-6829: With YARN-3045, is it still necessary? Container level metrics like this seems to be quite useful for other frameworks other than MR and it is something YARN can provide if it hasn't been done. > Add peak memory usage counter for each task > --- > > Key: MAPREDUCE-6829 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv2 >Reporter: Yufei Gu >Assignee: Miklos Szegedi > Fix For: 2.9.0 > > Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, > MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, > MAPREDUCE-6829.005.patch > > > Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which > are snapshots of memory usage of that task. They are not sufficient for users > to understand peak memory usage by that task, e.g. in order to diagnose task > failures, tune job parameters or change application design. This new feature > will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and > VIRTUAL_MEMORY_BYTES_MAX. > This JIRA has the same feature from MAPREDUCE-4710. I file this new YARN > JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less > assumes a branch-1 architecture, should be close at this point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly
[ https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951332#comment-15951332 ] Ming Ma commented on MAPREDUCE-6846: +1. I will wait until EOD to commit in case [~dan...@cloudera.com] [~jlowe] [~sjlee0] have other suggestions. > Fragments specified for libjar paths are not handled correctly > -- > > Key: MAPREDUCE-6846 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2 >Reporter: Chris Trezzo >Assignee: Chris Trezzo >Priority: Minor > Attachments: MAPREDUCE-6846-trunk.001.patch, > MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch, > MAPREDUCE-6846-trunk.004.patch, MAPREDUCE-6846-trunk.005.patch > > > If a user specifies a fragment for a libjars path via generic options parser, > the client crashes with a FileNotFoundException: > {noformat} > java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314) > at > org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387) > at > org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154) > at > org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105) > at > org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) > at > org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > {noformat} > This is actually inconsistent with the behavior for files and archives. Here > is a table showing the current behavior for each type of path and resource: > | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute > path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. > test.txt#frag.txt) || > || -libjars | FileNotFound | FileNotFound|FileNotFound| > || -files | (/) | (/) | (/) | > || -archives | (/) | (/) | (/) | -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly
[ https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950109#comment-15950109 ] Ming Ma commented on MAPREDUCE-6846: Overall looks good. Thanks [~ctrezzo] for working on this. Any idea if {{DistributedCache.addCacheFile}} in the following block is necessary? {noformat} if (useWildcard && !foundFragment) { ... } else { for (URI uri : libjarURIs) { DistributedCache.addCacheFile(uri, conf); } } {noformat} > Fragments specified for libjar paths are not handled correctly > -- > > Key: MAPREDUCE-6846 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2 >Reporter: Chris Trezzo >Assignee: Chris Trezzo >Priority: Minor > Attachments: MAPREDUCE-6846-trunk.001.patch, > MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch > > > If a user specifies a fragment for a libjars path via generic options parser, > the client crashes with a FileNotFoundException: > {noformat} > java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314) > at > org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387) > at > org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154) > at > org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105) > at > org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) > at > org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > {noformat} > This is actually inconsistent with the behavior for files and archives. Here > is a table showing the current behavior for each type of path and resource: > | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute > path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. > test.txt#frag.txt) || > || -libjars | FileNotFound | FileNotFound|FileNotFound| > || -files | (/) | (/) | (/) | > || -archives | (/) | (/) | (/) | -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubs
[jira] [Updated] (MAPREDUCE-6862) Fragments are not handled correctly by resource limit checking
[ https://issues.apache.org/jira/browse/MAPREDUCE-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-6862: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 2.9.0 Status: Resolved (was: Patch Available) +1. Committed to trunk and branch-2. [~ctrezzo] thanks for the contribution! > Fragments are not handled correctly by resource limit checking > -- > > Key: MAPREDUCE-6862 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6862 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha1 >Reporter: Chris Trezzo >Assignee: Chris Trezzo >Priority: Minor > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: MAPREDUCE-6862-trunk.001.patch > > > If a user specifies a fragment for a libjar, files, archives path via generic > options parser and resource limit checking is enabled, the client crashes > with a FileNotFoundException: > {noformat} > java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.mapreduce.JobResourceUploader.getFileStatus(JobResourceUploader.java:413) > at > org.apache.hadoop.mapreduce.JobResourceUploader.explorePath(JobResourceUploader.java:395) > at > org.apache.hadoop.mapreduce.JobResourceUploader.checkLocalizationLimits(JobResourceUploader.java:304) > at > org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:103) > at > org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) > at > org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5044: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Eric Payne > Fix For: 2.8.0 > > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, > MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, > MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot > 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317343#comment-15317343 ] Ming Ma commented on MAPREDUCE-5044: I have committed the patch to trunk, branch-2 and branch-2.8. Thank you [~eepayne] and [~jira.shegalov] for the contribution and [~vinodkv] [~jlowe] and [~aw] for the review. > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Eric Payne > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, > MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, > MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot > 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317271#comment-15317271 ] Ming Ma commented on MAPREDUCE-5044: +1 on the latest patch. Thanks [~eepayne]. The patch doesn't resolve automatically for branch-2 and 2.8. It is straightforward and I will resolve it for those two branches. > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Eric Payne > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, > MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, > MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot > 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296554#comment-15296554 ] Ming Ma commented on MAPREDUCE-5044: Thanks [~eepayne]. Besides the checkstyle, whitespace and javadoc issues, * There is some commented-out code left after the function is moved to {{internalSignalToContainer}}. * Given {{signalContainer}} is renamed to {{signalToContainer}} for ContainerManagementProtocol, maybe better to fix that for ApplicationClientProtocol as well, as long as we agree to include this patch in 2.8. Otherwise, it looks good overall. > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.010.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, > MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot > 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292038#comment-15292038 ] Ming Ma commented on MAPREDUCE-5044: bq. In that case, do we want to call it something like signalsToContainers? Sounds good. signalsToContainers can take an array of {{SignalContainerRequest}}, each of which has a list of commands belonging to the same container. When we decide to add signalsToContainers later, deprecate signalToContainer and NM will still support signalToContainer until major upgrade. In that way, we don't need to fix {{required}} issue given only new signalsToContainers method will use list-based {{SignalContainerRequest}}. > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, > MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, > MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, > Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290193#comment-15290193 ] Ming Ma commented on MAPREDUCE-5044: [~eepayne] I agree with your suggestion. Let us postpone it to a later time. * {{signalContainers}} was initially suggested as an ordered list of {{signalContainer}}. So it could include requests from the same container or requests from different containers. It is true that the only use case we know of so far is to include requests from the same container. * We also discussed introducing other commands besides linux signal, for example sleep command used to pause between signals, in that way, the new API could be just like {noformat} public static SignalContainerRequest newInstance(ContainerId containerId, Iterable signals) { ... } {noformat} * Will the {{required}} in the protocol buffer definition create any issue if we do rolling upgrade from 2.8 to 2.9 and the 2.9 MR AM might send a list of SignalContainerCommandProto to 2.8 NM? Maybe 2.8 NM just discards the message, not a big deal. Regardless, that is a separate issue that we don't need to address it here. {noformat} message SignalContainerRequestProto { required SignalContainerCommandProto command = 2; } {noformat} > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, > MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, > MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, > Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285894#comment-15285894 ] Ming Ma commented on MAPREDUCE-5044: [~eepayne], my apologies for the delay. * There was some discussion about combining signalContainer and stopContainers so that stopContainer is just a special case for signalContainer. And to support the "SIGTERM + delay + SIGKILL" used in stopContainers, we then need an ordered list of commands, thus the need for signalContainers. We don't need to deal with that at this point. But it might be useful to rename signalContainer to signalContainers so that we don't need to modify the API later, which means some new structure like {{SignalContainersRequest}}. What is your take? * ContainerManagerImpl. It might be cleaner to abstract the common signal container code to a function used for both {{AM -> NM}} and {{RM -> NM}} cases. * TaskAttemptImpl#PreemptedTransition. Given it is called only when the attempt is preempted, {{event.getType() == TaskAttemptEventType.TA_TIMED_OUT}} can be replaced by {{false}}. * It will be useful to add an end-to-end new unit test, which can be found in Gera's original patch. * Nit: ContainerLauncherImpl. Return value of {{getContainerManagementProtocol().signalContainer}} isn't used and can be removed. * Nit: ContainerLauncherEvent has indent format issue. > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, > MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, > MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, > MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, > Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6694) Make AM more resilient to potential lost of any completed container notification
Ming Ma created MAPREDUCE-6694: -- Summary: Make AM more resilient to potential lost of any completed container notification Key: MAPREDUCE-6694 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6694 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ming Ma YARN tries to guarantee any completed container notification is delivered to AM under any circumstance, YARN-1372 is an example to make sure for the case of RM restart. However, under some corner cases, it is still possible a completed container notifications is lost or significantly delayed. For example, if NM host becomes dead when RM fails over. AM won't preempt reducers if it thought there is at least one mapper running. {noformat} void preemptReducesIfNeeded() { ... if (assignedRequests.maps.size() > 0) { // there are assigned mappers return; } ... {noformat} Instead of completely depending on notification from RM, it can use TaskUmbilicalProtocol to help to decide if there is any mapper running. That will make AM more resilient to any bugs in YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6315) Implement retrieval of logs for crashed MR-AM via jhist in the staging directory
[ https://issues.apache.org/jira/browse/MAPREDUCE-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256607#comment-15256607 ] Ming Ma commented on MAPREDUCE-6315: Thanks [~jira.shegalov]! We definitely need to support this scenario. There was some discussion about providing clean-up functionality in YARN-2261 and MAPREDUCE-4428 so that history files can be moved to final location properly. But it isn't clear when we plan to provide such functionality at YARN layer and it seems like a larger effort. The patch here is more targeted and can take care of the issue we have had, at least until the end-to-end clean-up functionality is available. What do you think? Specific comments for the jira: * This will enable "mapred job -logs" usage. How about the "jobhistory URL http://./jobhistory/job/job_ returns job not found" scenario, is it easy to add the redirect at that level? * globStatus might return empty list. So it might be better to change from {{if (jhStats != null) }} to something like {{if (jhStats != null && jhStats.length > 0)}}. * Is the output format change in JobHistoryParser required? Wonder if there is any backward compatibility issue if some tools have assumption about this. > Implement retrieval of logs for crashed MR-AM via jhist in the staging > directory > > > Key: MAPREDUCE-6315 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6315 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client, mr-am >Affects Versions: 2.7.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Critical > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-6315.001.patch, MAPREDUCE-6315.002.patch, > MAPREDUCE-6315.003.patch > > > When all AM attempts crash, there is no record of them in JHS. Thus no easy > way to get the logs. This JIRA automates the procedure by utilizing the jhist > file in the staging directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics
[ https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243982#comment-15243982 ] Ming Ma commented on MAPREDUCE-6660: Thank you [~djp] and [~liuml07]! Sure let me update the unit test. Regarding the rename to BYTES_READ_LOCAL_DATACENTER, etc. it seems reasonable as it covers the common scenario, however it might not be general enough to cover all sort of topologies. For a large cluster with 4 tiers, /edge router/core switch/TOR/local machine, BYTES_READ_SECOND_OR_MORE_DEGREE_REMOTE_RACK might mean BYTES_READ_LOCAL_DATACENTER. What do you think? > Add MR Counters for bytes-read-by-network-distance FileSystem metrics > - > > Key: MAPREDUCE-6660 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png > > > This is the MR part of the change which is to consume > bytes-read-by-network-distance metrics generated by > https://issues.apache.org/jira/browse/HDFS-9579. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics
[ https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-6660: --- Assignee: Ming Ma Status: Patch Available (was: Open) > Add MR Counters for bytes-read-by-network-distance FileSystem metrics > - > > Key: MAPREDUCE-6660 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png > > > This is the MR part of the change which is to consume > bytes-read-by-network-distance metrics generated by > https://issues.apache.org/jira/browse/HDFS-9579. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics
[ https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-6660: --- Attachment: MAPREDUCE-6660.png MAPREDUCE-6660.patch Here is the draft patch and the MR webUI. The webUI becomes somewhat busy given these new counters will be created for each FileSystem. We can consider skip the rows if the values are zero if there is no compatibility issue here. > Add MR Counters for bytes-read-by-network-distance FileSystem metrics > - > > Key: MAPREDUCE-6660 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ming Ma > Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png > > > This is the MR part of the change which is to consume > bytes-read-by-network-distance metrics generated by > https://issues.apache.org/jira/browse/HDFS-9579. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics
Ming Ma created MAPREDUCE-6660: -- Summary: Add MR Counters for bytes-read-by-network-distance FileSystem metrics Key: MAPREDUCE-6660 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Ming Ma This is the MR part of the change which is to consume bytes-read-by-network-distance metrics generated by https://issues.apache.org/jira/browse/HDFS-9579. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6456) Support configurable log aggregation policy
Ming Ma created MAPREDUCE-6456: -- Summary: Support configurable log aggregation policy Key: MAPREDUCE-6456 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6456 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ming Ma YARN-221 provides a way for a YARN application to specify log aggregation policy via LogAggregationContext. This jira covers the necessary changes in MR to use that feature so that any MR job can specify its log aggregation policy via job configuration. That includes: * Have MR define its own configurations to config these policies. * Make code change at YarnRunner to retrieve these configurations and set the values via LogAggregationContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from mapred-default.xml) to branch-2
[ https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628434#comment-14628434 ] Ming Ma commented on MAPREDUCE-5762: +1. Thanks [~ajisakaa]. > Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from > mapred-default.xml) to branch-2 > --- > > Key: MAPREDUCE-5762 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: documentation >Affects Versions: 2.3.0 >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA >Priority: Minor > Attachments: MAPREDUCE-5762-branch-2-002.patch, > MAPREDUCE-5762-branch-2.03.patch, MAPREDUCE-5762-branch-2.patch, > MAPREDUCE-5762.003.branch-2.patch > > > MRv1 configs are removed in trunk, but they are not removed in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from mapred-default.xml) to branch-2
[ https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542203#comment-14542203 ] Ming Ma commented on MAPREDUCE-5762: LGTM. > Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from > mapred-default.xml) to branch-2 > --- > > Key: MAPREDUCE-5762 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: documentation >Affects Versions: 2.3.0 >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA >Priority: Minor > Attachments: MAPREDUCE-5762-branch-2-002.patch, > MAPREDUCE-5762-branch-2.03.patch, MAPREDUCE-5762-branch-2.patch, > MAPREDUCE-5762.003.branch-2.patch > > > MRv1 configs are removed in trunk, but they are not removed in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2
[ https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540199#comment-14540199 ] Ming Ma commented on MAPREDUCE-5762: [~ajisakaa], here is an example of test failures. The property is defined in trunk. {noformat} Running org.apache.hadoop.mapreduce.task.reduce.TestMerger Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.625 sec <<< FAILURE! - in org.apache.hadoop.mapreduce.task.reduce.TestMerger testEncryptedMerger(org.apache.hadoop.mapreduce.task.reduce.TestMerger) Time elapsed: 0.092 sec <<< ERROR! java.lang.NullPointerException: null at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.MROutputFiles.getInputFileForWrite(MROutputFiles.java:206) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$InMemoryMerger.merge(MergeManagerImpl.java:459) at org.apache.hadoop.mapreduce.task.reduce.TestMerger.testInMemoryAndOnDiskMerger(TestMerger.java:136) at org.apache.hadoop.mapreduce.task.reduce.TestMerger.testEncryptedMerger(TestMerger.java:92) {noformat} > Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2 > > > Key: MAPREDUCE-5762 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: documentation >Affects Versions: 2.3.0 >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA >Priority: Minor > Fix For: 2.8.0 > > Attachments: MAPREDUCE-5762-branch-2-002.patch, > MAPREDUCE-5762-branch-2.patch > > > MRv1 configs are removed in trunk, but they are not removed in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5465) Tasks are often killed before they exit on their own
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538833#comment-14538833 ] Ming Ma commented on MAPREDUCE-5465: Thanks Jason and Ray. > Tasks are often killed before they exit on their own > > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Fix For: 2.8.0 > > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, > MAPREDUCE-5465-branch-2.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2
[ https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538608#comment-14538608 ] Ming Ma commented on MAPREDUCE-5762: [~ajisakaa] it looks like mapreduce.cluster.local.dir was removed as part of this and caused some branch-2 unit tests to fail. > Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2 > > > Key: MAPREDUCE-5762 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: documentation >Affects Versions: 2.3.0 >Reporter: Akira AJISAKA >Assignee: Akira AJISAKA >Priority: Minor > Fix For: 2.8.0 > > Attachments: MAPREDUCE-5762-branch-2-002.patch, > MAPREDUCE-5762-branch-2.patch > > > MRv1 configs are removed in trunk, but they are not removed in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-branch-2.patch Thanks [~jlowe] for spending time on this. For the handling of TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP, do you mean https://issues.apache.org/jira/browse/MAPREDUCE-5776? If so, we can address the issue in that jira. Here is the patch for branch-2. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, > MAPREDUCE-5465-branch-2.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Labels: BB2015-05-RFC (was: BB2015-05-TBR) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Labels: BB2015-05-RFC > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-9.patch Ray, here is the rebased patch. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378161#comment-14378161 ] Ming Ma commented on MAPREDUCE-5465: [~rchiang], thanks for looking into this. SUCCESS_CONTAINER_CLEANUP can be transitioned from SUCCESS_FINISHING_CONTAINER. For ExitFinishingOnTimeoutTransition , you can search for FINISHING_ON_TIMEOUT_TRANSITION. We have been running a slight different version of this patch in our production clusters for a while. I can rebase the patch for trunk if people are interested in it. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAPREDUCE-6135) Job staging directory remains if MRAppMaster is OOM
[ https://issues.apache.org/jira/browse/MAPREDUCE-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma resolved MAPREDUCE-6135. Resolution: Duplicate Thanks, Jason. Resolve this as dup. Will continue the discussion over at MAPREDUCE-5502. It looks like Robert in MAPREDUCE-4428 also mentioned the approach of rerun AM for cleanup. > Job staging directory remains if MRAppMaster is OOM > --- > > Key: MAPREDUCE-6135 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ming Ma > > If MRAppMaster attempts run out of memory, it won't go through the normal job > clean up process to move history files to history server location. When > customers try to find out why the job failed, the data won't be available on > history server webUI. > The work around is to extract the container id and NM id from the jhist file > in the job staging directory; then use "yarn logs" command to get the AM logs. > It would be great the platform can take care of it by moving these hist files > automatically to history server if AM attempts don't exit properly. > We discuss ideas on how to address this and would like get suggestions from > others. Not sure if timeline server design covers this scenario. > 1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max > attempt, please clean up". For example, YARN can launch AppMaster one more > time after AM max attempt and MRAppMaster use that as the indication this is > clean-up-only attempt. > 2. Have some program periodically check job statuses and move files from job > staging directory to history server for those finished jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAPREDUCE-6135) Job staging directory remains if MRAppMaster is OOM
Ming Ma created MAPREDUCE-6135: -- Summary: Job staging directory remains if MRAppMaster is OOM Key: MAPREDUCE-6135 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ming Ma If MRAppMaster attempts run out of memory, it won't go through the normal job clean up process to move history files to history server location. When customers try to find out why the job failed, the data won't be available on history server webUI. The work around is to extract the container id and NM id from the jhist file in the job staging directory; then use "yarn logs" command to get the AM logs. It would be great the platform can take care of it by moving these hist files automatically to history server if AM attempts don't exit properly. We discuss ideas on how to address this and would like get suggestions from others. Not sure if timeline server design covers this scenario. 1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max attempt, please clean up". For example, YARN can launch AppMaster one more time after AM max attempt and MRAppMaster use that as the indication this is clean-up-only attempt. 2. Have some program periodically check job statuses and move files from job staging directory to history server for those finished jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139518#comment-14139518 ] Ming Ma commented on MAPREDUCE-5891: Junping, Jason, the patch looks good to me. > Improved shuffle error handling across NM restarts > -- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, > MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127212#comment-14127212 ] Ming Ma commented on MAPREDUCE-5891: The patch looks good. I like Jason's idea to have mapreduce.reduce.shuffle.fetch.retry.enabled use ${yarn.nodemanager.recovery.enabled} as default value. As for the other approaches, a) dynamic MR to YARN query, given NM recovery flag is a global cluster level setting ( although it is possible to config it on per NM basis ), can we derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission time from some YARN API call to RM? b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http header via property key names. So if we add a new property to indicate if recovery is supported and continue to keep the same http "version" property, new version of fetcher might be able to work with old version of shufflehandler, and vise versa. > Improved shuffle error handling across NM restarts > -- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126643#comment-14126643 ] Ming Ma commented on MAPREDUCE-5891: Thanks, Junping. Regarding the default value, mapreduce.reduce.shuffle.fetch.retry.enabled is set to true by default; while NM recovery is set to false by default. That means by default, fetcher will retry even though shufflehandler won't be able to serve mapper outputs after restart. It doesn't seem like a big deal. Just want to call out if that is intentional. Do we foresee other scenarios where fetch retry will be useful? If not, reducers can ask YARN if NM recovery is enabled or reducers can ask shufflehandler if recovery is enabled without defining this retry property. > Improved shuffle error handling across NM restarts > -- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: (was: MAPREDUCE-5465-8.patch) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-8.patch > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-8.patch Updated patch to address javac warnings. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Status: Patch Available (was: Open) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Status: Open (was: Patch Available) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118290#comment-14118290 ] Ming Ma commented on MAPREDUCE-5891: Thanks, Junping, Jason for the useful patch. In the case slowstart is set to some small value, the reducer will fetch some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart? To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the assumption is we will have some sort of graceful decommission support so that during decommission process the fetcher will still be able to get mapper output. Is it true? If we get time to do YARN-1593, that will further reduce the chance of shuffle handler restart. Any opinion on that? > Improved shuffle error handling across NM restarts > -- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-7.patch > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: (was: MAPREDUCE-5465-7.patch) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Status: Patch Available (was: Open) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Status: Open (was: Patch Available) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Affects Version/s: (was: 2.0.3-alpha) Status: Open (was: Patch Available) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Affects Version/s: (was: trunk) Status: Patch Available (was: Open) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-7.patch Sorry for the delay. Jason, here is the new patch with your suggestion of "to have TA notify task TA_ATTEMPT_SUCCEEDED or TA_ATTEMPT_FAILED after it receives notification from TaskUmbilicalProtocol" and code clean up. Appreciate your input. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106250#comment-14106250 ] Ming Ma commented on MAPREDUCE-4815: To have first task's recoverTask recover all succeeded tasks seems to work functionality wise. If the first task fails to recoverTask due to fs.rename exception, it will be rescheduled; the second task's recoverTask can continue to recover the succeeded tasks. It does change the semantics of recoverTask. It is no longer done on per task basis. But perhaps we can treat it as an optimization; other OutputCommitter implementations can still choose to have recovery on per task basis. For the upgrade scenario, how does it clean up the succeeded task attempt data in the old scheme? > FileOutputCommitter.commitJob can be very slow for jobs with many output files > -- > > Key: MAPREDUCE-4815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 >Reporter: Jason Lowe >Assignee: Siqi Li > Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, > MAPREDUCE-4815.v5.patch > > > If a job generates many files to commit then the commitJob method call at the > end of the job can take minutes. This is a performance regression from 1.x, > as 1.x had the tasks commit directly to the final output directory as they > were completing and commitJob had very little to do. The commit work was > processed in parallel and overlapped the processing of outstanding tasks. In > 0.23/2.x, the commit is single-threaded and waits until all tasks have > completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101801#comment-14101801 ] Ming Ma commented on MAPREDUCE-4815: Siqi, thanks for the patch. It looks good overall. Will the upgrade scenario work for task recovery? It seems the data stored in the old output structure won't be honored by the new scheme after cluster restarts with the patch; it skips the recoverTask given TempOutputPath doesn't exist. If that is the case, perhaps the patch can through some exception so that the task attempt state changes to KILLED state for retry. Alternatively, the new patch can be modified to handle the recovery of old directory structure, but that seems over complicated. > FileOutputCommitter.commitJob can be very slow for jobs with many output files > -- > > Key: MAPREDUCE-4815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1 >Reporter: Jason Lowe >Assignee: Siqi Li > Attachments: MAPREDUCE-4815.v3.patch > > > If a job generates many files to commit then the commitJob method call at the > end of the job can take minutes. This is a performance regression from 1.x, > as 1.x had the tasks commit directly to the final output directory as they > were completing and commitJob had very little to do. The commit work was > processed in parallel and overlapped the processing of outstanding tasks. In > 0.23/2.x, the commit is single-threaded and waits until all tasks have > completed before commencing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-207) Computing Input Splits on the MR Cluster
[ https://issues.apache.org/jira/browse/MAPREDUCE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046497#comment-14046497 ] Ming Ma commented on MAPREDUCE-207: --- Thanks, Gera. Nice work and this will be quite useful. Overall it looks good. Per offline discussion with Gera, 1. It is unclear if there is any security related implication such as https://issues.apache.org/jira/browse/MAPREDUCE-5663. 2. The compatibility between new MR client with this feature and cluster with old MR. Given new MR client won't compute the split by default; the job will fail if the cluster still uses old MR. So in this case, new MR client needs to be configured to compute split. For a more general case where new MR client can talk to some cluster with old MR and some cluster with new MR, it will be nice if client can discover if the cluster supports this feature. > Computing Input Splits on the MR Cluster > > > Key: MAPREDUCE-207 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-207 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: applicationmaster, mrv2 >Reporter: Philip Zeyliger >Assignee: Arun C Murthy > Attachments: MAPREDUCE-207.patch, MAPREDUCE-207.v02.patch, > MAPREDUCE-207.v03.patch, MAPREDUCE-207.v05.patch > > > Instead of computing the input splits as part of job submission, Hadoop could > have a separate "job task type" that computes the input splits, therefore > allowing that computation to happen on the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5790) Default map hprof profile options do not work
[ https://issues.apache.org/jira/browse/MAPREDUCE-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002782#comment-14002782 ] Ming Ma commented on MAPREDUCE-5790: It appears https://issues.apache.org/jira/browse/MAPREDUCE-5650 set mapreduce.task.profile.map.params ${mapreduce.task.profile.params} The reading code doesn't like it. If you remove the default setting, things will work fine. Perhaps we can leave the default value empty so profile works out of box. > Default map hprof profile options do not work > - > > Key: MAPREDUCE-5790 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5790 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.3.0 > Environment: java version "1.6.0_31" > Java(TM) SE Runtime Environment (build 1.6.0_31-b04) > Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) >Reporter: Andrew Wang > > I have an MR job doing the following: > {code} > Job job = Job.getInstance(conf); > // Enable profiling > job.setProfileEnabled(true); > job.setProfileTaskRange(true, "0"); > job.setProfileTaskRange(false, "0"); > {code} > When I run this job, some of my map tasks fail with this error message: > {noformat} > org.apache.hadoop.util.Shell$ExitCodeException: > /data/5/yarn/nm/usercache/hdfs/appcache/application_1394482121761_0012/container_1394482121761_0012_01_41/launch_container.sh: > line 32: $JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true > -Dhadoop.metrics.log.level=WARN -Xmx825955249 -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41 > -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA > ${mapreduce.task.profile.params} org.apache.hadoop.mapred.YarnChild > 10.20.212.12 43135 attempt_1394482121761_0012_r_00_0 41 > 1>/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41/stdout > > 2>/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41/stderr > : bad substitution > {noformat} > It looks like ${mapreduce.task.profile.params} is not getting subbed in > correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994140#comment-13994140 ] Ming Ma commented on MAPREDUCE-5465: Thanks, Jason! We have discussed the performance implication in https://issues.apache.org/jira/browse/YARN-221. It is good to revisit the issue. 1. I assume job latency is the metric we want to use. The question is how much such change impacts the job latency. 2. Say umbilical notification is at t1, task receives T_ATTEMPT_SUCCEEDED or T_ATTEMPT_FAILED at t2, MRAppMaster acquires new containers from RM for next set of tasks at t3. 3. How much does (t2-t1) impact job latency? It depends on the job characteristics. mapper output can be available sooner; reducer containers can be scheduled sooner, etc. But it isn't going to be linear to number of tasks; given tasks run in parallel. So it should be much smaller. I don't have the formula. It will be useful to compare the performance difference using actual jobs. 4. Your suggestion of notifying task/job right after t1 is a good idea to improve (t2-t1). I assume it doesn't change the state transition of task attempt. We need to confirm state machine correctness point of view, given there might be some assumptions between task attempt and task state machines. 5. (t3-t1) can also impact job latency. Notifying task/job earlier won't help to improve (t3-t1). 6. To improve (t3-t1), perhaps when container exits, it should send OutofBandHeartBeat. Currently OutofBandHeartBeat is sent only when stopContainer is called. Perhaps This is useful when NM->RM's heartbeat interval is big. 7. It appears there is some issue w.r.t. the current stopContainer's calling NodeStatusUpdaterImpl's OutofBandHeartBeat processing. stopContainer first enqueues "kill" container event before calling NodeStatusUpdaterImpl's OutofBandHeartBeat. So it is possible the NodeStatusUpdaterImpl heartbeat thread sends the heartbeat to RM before the main Dispatcher thread processes the event and mark the container as completed. Thus the OutofBandHeartBeat doesn't include that container in the completed container list. Does it really need to call NodeStatusUpdaterImpl's OutofBandHeartBeat in stopContainer? It seems it is better to call it only when a container exits. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-6.patch Merge batch with latest trunk. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987115#comment-13987115 ] Ming Ma commented on MAPREDUCE-5652: Sounds good, we can use a new jira to cover "best effort" work. The patch looks good. Just to confirm, protobuf should be backward compatible, e.g., the store state serialized with version 2.4 should be readable by NM/MR compiled with version 2.5. On an unrelated note, based on how NM's AuxServices' serviceStart handles error for each AuxService' serviceStart, if one AuxService throws some exception, the rest of AuxServices' serviceStart will be skipped. That isn't important given we only have one AuxService. Perhaps there is some policy around that as well, should NM skip failed AuxService? It seems in general we might need to improve AuxService handling if there are other AuxServices. > NM Recovery. ShuffleHandler should handle NM restarts > - > > Key: MAPREDUCE-5652 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Jason Lowe > Labels: shuffle > Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, > MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, > MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, > MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch > > > ShuffleHandler should work across NM restarts and not require re-running > map-tasks. On NM restart, the map outputs are cleaned up requiring > re-execution of map tasks and should be avoided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985932#comment-13985932 ] Ming Ma commented on MAPREDUCE-5652: 1. Regarding generic interface for restore/recover, I agree there is no much benefit to generalize things for the sake of it. One scenario could be something like ShuffleHandler, some ShuffleHandlers support recovery, some don't. NM can ask if a specific ShuffleHandler if it supports recovery, NM will manage the underlying store and pass the store object to ShuffleHandler and ShuffleHandler manages the serialization and deserialization, etc. If NM decides to change the underlying store and ShuffleHandler doesn't need to change. But at this point, it seems unnecessary. 2. If ShuffleHandler gets DBException during recoverState as part of serviceStart, should ShuffleHandler ignore the exception and continue like the store doesn't exist? The argument for ignoring it is it is soft state and ShuffleHandler can still run without it. Or maybe this can be configurable. > NM Recovery. ShuffleHandler should handle NM restarts > - > > Key: MAPREDUCE-5652 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Jason Lowe > Labels: shuffle > Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, > MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, > MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, MAPREDUCE-5652.patch > > > ShuffleHandler should work across NM restarts and not require re-running > map-tasks. On NM restart, the map outputs are cleaned up requiring > re-execution of map tasks and should be avoided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978830#comment-13978830 ] Ming Ma commented on MAPREDUCE-5652: Thanks, Jason. It is good to know it will be taken care of at YARN layer. I will post some more comments at YARN-1336. 1. Does leveDB's delete method throw exception? JNI has some exception handling and the caller needs to retrieve the exceptions, etc. 2. It seems like recover/restore are common in NM/RM restart. Any abstract interface defined for that? > NM Recovery. ShuffleHandler should handle NM restarts > - > > Key: MAPREDUCE-5652 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Jason Lowe > Labels: shuffle > Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, > MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, > MAPREDUCE-5652.patch > > > ShuffleHandler should work across NM restarts and not require re-running > map-tasks. On NM restart, the map outputs are cleaned up requiring > re-execution of map tasks and should be avoided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts
[ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977938#comment-13977938 ] Ming Ma commented on MAPREDUCE-5652: Nice work. Jason, I would like to clarify how the following scenarios are handled. Perhaps they are covered at the YARN layer as part of https://issues.apache.org/jira/browse/YARN-1336. 1. NM crash scenario. There is a corner case, after RM notifies NM regarding the completion of a specific application, right before AuxServices get the chance to process the event, NM crashes. The app entry won't be removed after the recovery store after NM is restarted, as APPLICATION_STOP won't be delivered to NM for that application after NM restart. 2. NM graceful shutdown. It seems ContainerManagerImpl's serviceStop will generate ContainerManagerEventType.FINISH_APPS event. That means AuxServices could clean up and remove it from the recovery store as part of NM shutdown. > NM Recovery. ShuffleHandler should handle NM restarts > - > > Key: MAPREDUCE-5652 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Jason Lowe > Labels: shuffle > Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, > MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, > MAPREDUCE-5652.patch > > > ShuffleHandler should work across NM restarts and not require re-running > map-tasks. On NM restart, the map outputs are cleaned up requiring > re-execution of map tasks and should be avoided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-5.patch Updated version that fixes javac and findbug warning. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-4.patch Updates per Jason's suggestions. 1. This patch also includes fix for https://issues.apache.org/jira/browse/MAPREDUCE-5835. Otherwise, some unit tests might fail due to new states introduced. 2. Fix the handling of TA_CONTAINER_COMPLETED for other cases as well. For example if TA receives TA_CONTAINER_COMPLETED when it is in RUNNING state, it doesn't need to transition to FAIL_CONTAINER_CLEANUP to clean up container. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state
[ https://issues.apache.org/jira/browse/MAPREDUCE-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5835: --- Attachment: MAPREDUCE-5835.patch Here is the patch with the fix and the unit test that reproduces the race condition. There might be other ways to fix the issues. > Killing Task might cause the job to go to ERROR state > - > > Key: MAPREDUCE-5835 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: MAPREDUCE-5835.patch > > > There could be a race condition if job is killed right after task attempt > receives TA_DONE event. In that case, TaskImpl might receive > T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus > transition job to ERROR state. > a. The task is in KILL_WAIT. > b. TA receives TA_DONE event. > c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event. > d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the > task. The task transitions to KILLED state. > e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task. > f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus > transition job to ERROR state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state
[ https://issues.apache.org/jira/browse/MAPREDUCE-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5835: --- Status: Patch Available (was: Open) > Killing Task might cause the job to go to ERROR state > - > > Key: MAPREDUCE-5835 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: MAPREDUCE-5835.patch > > > There could be a race condition if job is killed right after task attempt > receives TA_DONE event. In that case, TaskImpl might receive > T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus > transition job to ERROR state. > a. The task is in KILL_WAIT. > b. TA receives TA_DONE event. > c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event. > d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the > task. The task transitions to KILLED state. > e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task. > f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus > transition job to ERROR state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state
Ming Ma created MAPREDUCE-5835: -- Summary: Killing Task might cause the job to go to ERROR state Key: MAPREDUCE-5835 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma There could be a race condition if job is killed right after task attempt receives TA_DONE event. In that case, TaskImpl might receive T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus transition job to ERROR state. a. The task is in KILL_WAIT. b. TA receives TA_DONE event. c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event. d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the task. The task transitions to KILLED state. e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task. f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus transition job to ERROR state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964753#comment-13964753 ] Ming Ma commented on MAPREDUCE-5465: Thanks Jason for the review. I will upload the updated patch soon. Want to comment on the couple points you mentioned. 1. Yes, putting finishTaskMonitor under TaskAttemptListenerImpl isn't clean, given TaskAttemptListenerImpl should only deal with TaskUmbilicalProtocol related. I will move it out to AppContext layer. 2. Handling of TA_FAILMSG event. TA_FAILMSG can be triggered by task JVM as well as user via "hadoop job -fail-task command". For the case where task JVM reports failure, yes, it can wait for the container to exit. For the case where end users send the command, it will need to clean up the container right away. I skipped that for simplicity. If we want to support that, it seems we will need a new event like TA_FAILMSG_BY_USER. 3. Why are we transitioning from FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP rather than to SUCCEEDED when we receive a container completed event? It was done for simplicity so that all successful states will go to SUCCESS_CONTAINER_CLEANUP first. But I agree it can go directly to SUCCEEDED when we receive a container completed event. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-3.patch Updated patch for the latest trunk. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them
[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955652#comment-13955652 ] Ming Ma commented on MAPREDUCE-5044: This is quite useful. Can we get this and YARN-1515 in 2.4.0 release? > Have AM trigger jstack on task attempts that timeout before killing them > > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at > 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5784) CLI update so that people can send signal to a specific task
Ming Ma created MAPREDUCE-5784: -- Summary: CLI update so that people can send signal to a specific task Key: MAPREDUCE-5784 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5784 Project: Hadoop Map/Reduce Issue Type: Task Reporter: Ming Ma This depends on https://issues.apache.org/jira/browse/YARN-445. MR client will first find out the container id for the specified task. Then it will use YARN API to signal the container. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5783) web UI update to allow people to request thread dump of a running task
Ming Ma created MAPREDUCE-5783: -- Summary: web UI update to allow people to request thread dump of a running task Key: MAPREDUCE-5783 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5783 Project: Hadoop Map/Reduce Issue Type: Task Components: webapps Reporter: Ming Ma This depends on https://issues.apache.org/jira/browse/YARN-445. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5776) Improve TaskAttempt's handling of TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state
[ https://issues.apache.org/jira/browse/MAPREDUCE-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5776: --- Summary: Improve TaskAttempt's handling of TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state (was: TaskAttempt should honor TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state) > Improve TaskAttempt's handling of TA_KILL event when TA is in > SUCCESS_CONTAINER_CLEANUP state > - > > Key: MAPREDUCE-5776 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5776 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Ming Ma > > In most states that a TaskAttempt goes through, such as ASSIGNED, RUNNING, > SUCCEEDED etc. If a TA receives TA_KILL, the state will transit to KILLED (if > the TA is in SUCCEEDED state, it depends on if it is a reducer task). > However, If the TA is in SUCCESS_CONTAINER_CLEANUP state, TA just ignores > TA_KILL. Later on, SUCCESS_CONTAINER_CLEANUP will move to SUCCEEDED state > after the container is cleaned up. So it is possible after a client issue a > kill request, the TA will eventually be in SUCCEEDED state. It isn't a major > issue. But from consistency's point of view, it is better if TA_KILL is > handled in a similar way as how it is handled when TA is in SUCCEEDED state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5776) TaskAttempt should honor TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state
Ming Ma created MAPREDUCE-5776: -- Summary: TaskAttempt should honor TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state Key: MAPREDUCE-5776 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5776 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ming Ma In most states that a TaskAttempt goes through, such as ASSIGNED, RUNNING, SUCCEEDED etc. If a TA receives TA_KILL, the state will transit to KILLED (if the TA is in SUCCEEDED state, it depends on if it is a reducer task). However, If the TA is in SUCCESS_CONTAINER_CLEANUP state, TA just ignores TA_KILL. Later on, SUCCESS_CONTAINER_CLEANUP will move to SUCCEEDED state after the container is cleaned up. So it is possible after a client issue a kill request, the TA will eventually be in SUCCEEDED state. It isn't a major issue. But from consistency's point of view, it is better if TA_KILL is handled in a similar way as how it is handled when TA is in SUCCEEDED state. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Affects Version/s: 2.0.3-alpha > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk, 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Affects Version/s: (was: 2.0.3-alpha) trunk Status: Patch Available (was: Open) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: trunk >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-5465: --- Attachment: MAPREDUCE-5465-2.patch Here is the patch. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916964#comment-13916964 ] Ming Ma commented on MAPREDUCE-5465: I discussed with Ravi offline and will provide the patch for review soon. The basic approach is to define a new state called FINISHING_CONTAINER for TaskAttemptStateInternal. TaskAttempt will transition to this new state after it receives TaskUmbilicalProtocol's done notification from the task JVM. This will give a chance for the container to exit by itself. Normally the attempt will receive container exit notification via NM -> RM -> AM route; if it doesn't get the notification in time, it will time out and clean up the container via stopContainer. > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAPREDUCE-5465) Container killed before hprof dumps profile.out
[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma reassigned MAPREDUCE-5465: -- Assignee: Ming Ma (was: Ravi Prakash) > Container killed before hprof dumps profile.out > --- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 >Affects Versions: 2.0.3-alpha >Reporter: Radim Kolar >Assignee: Ming Ma > Attachments: MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-4710) Add peak memory usage counter for each task
[ https://issues.apache.org/jira/browse/MAPREDUCE-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864833#comment-13864833 ] Ming Ma commented on MAPREDUCE-4710: A general question, should NM provide such data at the container level? It seems we need that information to support preemption and fairness anyway; NM needs to inform RM the actual resource utilization at container level; memory usage is one of the resource metrics. Currently ContainerStatus doesn't provide that level of details. > Add peak memory usage counter for each task > --- > > Key: MAPREDUCE-4710 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4710 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: task >Affects Versions: 1.0.2, trunk >Reporter: Cindy Li >Assignee: Cindy Li >Priority: Minor > Labels: patch > Fix For: trunk > > Attachments: MAPREDUCE-4710-trunk.patch, mapreduce-4710-v1.0.2.patch, > mapreduce-4710.patch, mapreduce4710-v3.patch, mapreduce4710-v6.patch, > mapreduce4710.patch > > > Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which > are snapshots of memory usage of that task. They are not sufficient for users > to understand peak memory usage by that task, e.g. in order to diagnose task > failures, tune job parameters or change application design. This new feature > will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and > VIRTUAL_MEMORY_BYTES_MAX. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-4710) Add peak memory usage counter for each task
[ https://issues.apache.org/jira/browse/MAPREDUCE-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816725#comment-13816725 ] Ming Ma commented on MAPREDUCE-4710: It doesn’t seem to be MR application specific, other YARN application might want this as well. Should it be done at NM level so that there are general container peak memory usage data? > Add peak memory usage counter for each task > --- > > Key: MAPREDUCE-4710 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4710 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: task >Affects Versions: 1.0.2 >Reporter: Cindy Li >Assignee: Cindy Li >Priority: Minor > Labels: patch > Attachments: MAPREDUCE-4710-trunk.patch, mapreduce-4710-v1.0.2.patch, > mapreduce-4710.patch, mapreduce4710.patch > > > Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which > are snapshots of memory usage of that task. They are not sufficient for users > to understand peak memory usage by that task, e.g. in order to diagnose task > failures, tune job parameters or change application design. This new feature > will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and > VIRTUAL_MEMORY_BYTES_MAX. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101649#comment-13101649 ] Ming Ma commented on MAPREDUCE-2779: Arun, the bug is still in the trunk. Thanks. > JobSplitWriter.java can't handle large job.split file > - > > Key: MAPREDUCE-2779 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission >Affects Versions: 0.20.205.0, 0.22.0, 0.23.0 >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 0.22.0 > > Attachments: MAPREDUCE-2779-trunk.patch > > > We use cascading MultiInputFormat. MultiInputFormat sometimes generates big > job.split used internally by hadoop, sometimes it can go beyond 2GB. > In JobSplitWriter.java, the function that generates such file uses 32bit > signed integer to compute offset into job.split. > writeNewSplits > ... > int prevCount = out.size(); > ... > int currCount = out.size(); > writeOldSplits > ... > long offset = out.size(); > ... > int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096548#comment-13096548 ] Ming Ma commented on MAPREDUCE-2779: It is tested on 0.20-security-* branches. Testing on 0.22 will be conducted later. > JobSplitWriter.java can't handle large job.split file > - > > Key: MAPREDUCE-2779 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission >Affects Versions: 0.20.205.0, 0.22.0, 0.23.0 >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 0.22.0 > > Attachments: MAPREDUCE-2779-trunk.patch > > > We use cascading MultiInputFormat. MultiInputFormat sometimes generates big > job.split used internally by hadoop, sometimes it can go beyond 2GB. > In JobSplitWriter.java, the function that generates such file uses 32bit > signed integer to compute offset into job.split. > writeNewSplits > ... > int prevCount = out.size(); > ... > int currCount = out.size(); > writeOldSplits > ... > long offset = out.size(); > ... > int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-2779: --- Affects Version/s: 0.20.205.0 Status: Patch Available (was: Open) > JobSplitWriter.java can't handle large job.split file > - > > Key: MAPREDUCE-2779 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission >Affects Versions: 0.20.205.0, 0.22.0, 0.23.0 >Reporter: Ming Ma > Attachments: MAPREDUCE-2779-trunk.patch > > > We use cascading MultiInputFormat. MultiInputFormat sometimes generates big > job.split used internally by hadoop, sometimes it can go beyond 2GB. > In JobSplitWriter.java, the function that generates such file uses 32bit > signed integer to compute offset into job.split. > writeNewSplits > ... > int prevCount = out.size(); > ... > int currCount = out.size(); > writeOldSplits > ... > long offset = out.size(); > ... > int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-2779: --- Affects Version/s: 0.23.0 0.22.0 > JobSplitWriter.java can't handle large job.split file > - > > Key: MAPREDUCE-2779 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission >Affects Versions: 0.22.0, 0.23.0 >Reporter: Ming Ma > Attachments: MAPREDUCE-2779-trunk.patch > > > We use cascading MultiInputFormat. MultiInputFormat sometimes generates big > job.split used internally by hadoop, sometimes it can go beyond 2GB. > In JobSplitWriter.java, the function that generates such file uses 32bit > signed integer to compute offset into job.split. > writeNewSplits > ... > int prevCount = out.size(); > ... > int currCount = out.size(); > writeOldSplits > ... > long offset = out.size(); > ... > int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
[ https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated MAPREDUCE-2779: --- Attachment: MAPREDUCE-2779-trunk.patch > JobSplitWriter.java can't handle large job.split file > - > > Key: MAPREDUCE-2779 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission >Reporter: Ming Ma > Attachments: MAPREDUCE-2779-trunk.patch > > > We use cascading MultiInputFormat. MultiInputFormat sometimes generates big > job.split used internally by hadoop, sometimes it can go beyond 2GB. > In JobSplitWriter.java, the function that generates such file uses 32bit > signed integer to compute offset into job.split. > writeNewSplits > ... > int prevCount = out.size(); > ... > int currCount = out.size(); > writeOldSplits > ... > long offset = out.size(); > ... > int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file
JobSplitWriter.java can't handle large job.split file - Key: MAPREDUCE-2779 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779 Project: Hadoop Map/Reduce Issue Type: Bug Components: job submission Reporter: Ming Ma We use cascading MultiInputFormat. MultiInputFormat sometimes generates big job.split used internally by hadoop, sometimes it can go beyond 2GB. In JobSplitWriter.java, the function that generates such file uses 32bit signed integer to compute offset into job.split. writeNewSplits ... int prevCount = out.size(); ... int currCount = out.size(); writeOldSplits ... long offset = out.size(); ... int currLen = out.size(); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira