[jira] [Commented] (MAPREDUCE-5951) Add support for the YARN Shared Cache

2017-10-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198205#comment-16198205
 ] 

Ming Ma commented on MAPREDUCE-5951:


+1.

> Add support for the YARN Shared Cache
> -
>
> Key: MAPREDUCE-5951
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5951
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5951-Overview.001.pdf, 
> MAPREDUCE-5951-trunk-020.patch, MAPREDUCE-5951-trunk-021.patch, 
> MAPREDUCE-5951-trunk-v1.patch, MAPREDUCE-5951-trunk-v10.patch, 
> MAPREDUCE-5951-trunk-v11.patch, MAPREDUCE-5951-trunk-v12.patch, 
> MAPREDUCE-5951-trunk-v13.patch, MAPREDUCE-5951-trunk-v14.patch, 
> MAPREDUCE-5951-trunk-v15.patch, MAPREDUCE-5951-trunk-v2.patch, 
> MAPREDUCE-5951-trunk-v3.patch, MAPREDUCE-5951-trunk-v4.patch, 
> MAPREDUCE-5951-trunk-v5.patch, MAPREDUCE-5951-trunk-v6.patch, 
> MAPREDUCE-5951-trunk-v7.patch, MAPREDUCE-5951-trunk-v8.patch, 
> MAPREDUCE-5951-trunk-v9.patch, MAPREDUCE-5951-trunk.016.patch, 
> MAPREDUCE-5951-trunk.017.patch, MAPREDUCE-5951-trunk.018.patch, 
> MAPREDUCE-5951-trunk.019.patch
>
>
> Implement the necessary changes so that the MapReduce application can 
> leverage the new YARN shared cache (i.e. YARN-1492).
> Specifically, allow per-job configuration so that MapReduce jobs can specify 
> which set of resources they would like to cache (i.e. jobjar, libjars, 
> archives, files).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5951) Add support for the YARN Shared Cache

2017-10-05 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194044#comment-16194044
 ] 

Ming Ma commented on MAPREDUCE-5951:


Thanks [~ctrezzo]. The code looks good overall. The only question I have at 
this point is if any code should be moved from MR to YARN to make it easier for 
other YARN applications to use shared cache. For example, maybe other 
applications can benefit from part of  LocalResourceBuilder or the special care 
when dealing with fragment.

> Add support for the YARN Shared Cache
> -
>
> Key: MAPREDUCE-5951
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5951
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5951-Overview.001.pdf, 
> MAPREDUCE-5951-trunk.016.patch, MAPREDUCE-5951-trunk.017.patch, 
> MAPREDUCE-5951-trunk.018.patch, MAPREDUCE-5951-trunk.019.patch, 
> MAPREDUCE-5951-trunk-020.patch, MAPREDUCE-5951-trunk-021.patch, 
> MAPREDUCE-5951-trunk-v10.patch, MAPREDUCE-5951-trunk-v11.patch, 
> MAPREDUCE-5951-trunk-v12.patch, MAPREDUCE-5951-trunk-v13.patch, 
> MAPREDUCE-5951-trunk-v14.patch, MAPREDUCE-5951-trunk-v15.patch, 
> MAPREDUCE-5951-trunk-v1.patch, MAPREDUCE-5951-trunk-v2.patch, 
> MAPREDUCE-5951-trunk-v3.patch, MAPREDUCE-5951-trunk-v4.patch, 
> MAPREDUCE-5951-trunk-v5.patch, MAPREDUCE-5951-trunk-v6.patch, 
> MAPREDUCE-5951-trunk-v7.patch, MAPREDUCE-5951-trunk-v8.patch, 
> MAPREDUCE-5951-trunk-v9.patch
>
>
> Implement the necessary changes so that the MapReduce application can 
> leverage the new YARN shared cache (i.e. YARN-1492).
> Specifically, allow per-job configuration so that MapReduce jobs can specify 
> which set of resources they would like to cache (i.e. jobjar, libjars, 
> archives, files).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

2017-04-21 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979625#comment-15979625
 ] 

Ming Ma commented on MAPREDUCE-6829:


[~miklos.szeg...@cloudera.com] if each MR task can somehow record the reference 
to its container id, then end users can get the data via taskId -> containerId 
-> containerUsage. Sure such approach is only useful when we expect more 
container metrics at the yarn layer will be added thus frameworks like MR can 
get the new metrics automatically.

> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-6829
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv2
>Reporter: Yufei Gu
>Assignee: Miklos Szegedi
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, 
> MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, 
> MAPREDUCE-6829.005.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX.
> This JIRA has the same feature from MAPREDUCE-4710.  I file this new YARN 
> JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less 
> assumes a branch-1 architecture, should be close at this point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

2017-04-03 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954401#comment-15954401
 ] 

Ming Ma commented on MAPREDUCE-6829:


With YARN-3045, is it still necessary? Container level metrics like this seems 
to be quite useful for other frameworks other than MR and it is something YARN 
can provide if it hasn't been done.

> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-6829
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv2
>Reporter: Yufei Gu
>Assignee: Miklos Szegedi
> Fix For: 2.9.0
>
> Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, 
> MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, 
> MAPREDUCE-6829.005.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX.
> This JIRA has the same feature from MAPREDUCE-4710.  I file this new YARN 
> JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less 
> assumes a branch-1 architecture, should be close at this point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

2017-03-31 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951332#comment-15951332
 ] 

Ming Ma commented on MAPREDUCE-6846:


+1. I will wait until EOD to commit in case [~dan...@cloudera.com] [~jlowe] 
[~sjlee0] have other suggestions.

> Fragments specified for libjar paths are not handled correctly
> --
>
> Key: MAPREDUCE-6846
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Minor
> Attachments: MAPREDUCE-6846-trunk.001.patch, 
> MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch, 
> MAPREDUCE-6846-trunk.004.patch, MAPREDUCE-6846-trunk.005.patch
>
>
> If a user specifies a fragment for a libjars path via generic options parser, 
> the client crashes with a FileNotFoundException:
> {noformat}
> java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> {noformat}
> This is actually inconsistent with the behavior for files and archives. Here 
> is a table showing the current behavior for each type of path and resource:
> | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute 
> path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. 
> test.txt#frag.txt) ||
> || -libjars | FileNotFound | FileNotFound|FileNotFound|
> || -files | (/) | (/) | (/) |
> || -archives | (/) | (/) | (/) |



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

2017-03-30 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950109#comment-15950109
 ] 

Ming Ma commented on MAPREDUCE-6846:


Overall looks good. Thanks [~ctrezzo] for working on this. Any idea if 
{{DistributedCache.addCacheFile}} in the following block is necessary?

{noformat}
  if (useWildcard && !foundFragment) {
  ...
  } else {
for (URI uri : libjarURIs) {
  DistributedCache.addCacheFile(uri, conf);
}
  }
{noformat}

> Fragments specified for libjar paths are not handled correctly
> --
>
> Key: MAPREDUCE-6846
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Minor
> Attachments: MAPREDUCE-6846-trunk.001.patch, 
> MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch
>
>
> If a user specifies a fragment for a libjars path via generic options parser, 
> the client crashes with a FileNotFoundException:
> {noformat}
> java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> {noformat}
> This is actually inconsistent with the behavior for files and archives. Here 
> is a table showing the current behavior for each type of path and resource:
> | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute 
> path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. 
> test.txt#frag.txt) ||
> || -libjars | FileNotFound | FileNotFound|FileNotFound|
> || -files | (/) | (/) | (/) |
> || -archives | (/) | (/) | (/) |



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubs

[jira] [Updated] (MAPREDUCE-6862) Fragments are not handled correctly by resource limit checking

2017-03-29 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-6862:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0-alpha3
   2.9.0
   Status: Resolved  (was: Patch Available)

+1. Committed to trunk and branch-2. [~ctrezzo] thanks for the contribution!

> Fragments are not handled correctly by resource limit checking
> --
>
> Key: MAPREDUCE-6862
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6862
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6862-trunk.001.patch
>
>
> If a user specifies a fragment for a libjar, files, archives path via generic 
> options parser and resource limit checking is enabled, the client crashes 
> with a FileNotFoundException:
> {noformat}
> java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.getFileStatus(JobResourceUploader.java:413)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.explorePath(JobResourceUploader.java:395)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.checkLocalizationLimits(JobResourceUploader.java:304)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:103)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-06-06 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5044:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Eric Payne
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, 
> MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, 
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, 
> MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot 
> 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-06-06 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317343#comment-15317343
 ] 

Ming Ma commented on MAPREDUCE-5044:


I have committed the patch to trunk, branch-2 and branch-2.8. Thank you 
[~eepayne] and [~jira.shegalov] for the contribution and [~vinodkv] [~jlowe] 
and [~aw] for the review.

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Eric Payne
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, 
> MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, 
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, 
> MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot 
> 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-06-06 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317271#comment-15317271
 ] 

Ming Ma commented on MAPREDUCE-5044:


+1 on the latest patch. Thanks [~eepayne]. The patch doesn't resolve 
automatically for branch-2 and 2.8. It is straightforward and I will resolve it 
for those two branches.

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Eric Payne
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.010.patch, MAPREDUCE-5044.011.patch, MAPREDUCE-5044.012.patch, 
> MAPREDUCE-5044.013.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, 
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, 
> MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot 
> 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-05-23 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296554#comment-15296554
 ] 

Ming Ma commented on MAPREDUCE-5044:


Thanks [~eepayne]. Besides the checkstyle, whitespace and javadoc issues,

* There is some commented-out code left after the function is moved to 
{{internalSignalToContainer}}.
* Given {{signalContainer}} is renamed to {{signalToContainer}} for 
ContainerManagementProtocol, maybe better to fix that for 
ApplicationClientProtocol as well, as long as we agree to include this patch in 
2.8.

Otherwise, it looks good overall.

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.010.patch, MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, 
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, 
> MAPREDUCE-5044.v06.patch, MAPREDUCE-5044.v07.local.patch, Screen Shot 
> 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-05-19 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292038#comment-15292038
 ] 

Ming Ma commented on MAPREDUCE-5044:


bq. In that case, do we want to call it something like signalsToContainers? 
Sounds good. signalsToContainers can take an array of 
{{SignalContainerRequest}}, each of which has a list of commands belonging to 
the same container. When we decide to add signalsToContainers later, deprecate 
signalToContainer and NM will still support signalToContainer until major 
upgrade. In that way, we don't need to fix {{required}} issue given only new 
signalsToContainers method will use list-based {{SignalContainerRequest}}.

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, 
> MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, 
> MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, 
> Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-05-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290193#comment-15290193
 ] 

Ming Ma commented on MAPREDUCE-5044:


[~eepayne] I agree with your suggestion. Let us postpone it to a later time.

* {{signalContainers}} was initially suggested as an ordered list of 
{{signalContainer}}. So it could include requests from the same container or 
requests from different containers. It is true that the only use case we know 
of so far is to include requests from the same container.

* We also discussed introducing other commands besides linux signal, for 
example sleep command used to pause between signals, in that way, the new API 
could be just like 
{noformat}
public static SignalContainerRequest newInstance(ContainerId containerId,
Iterable signals) {
...
}
{noformat}

* Will the {{required}} in the protocol buffer definition create any issue if 
we do rolling upgrade from 2.8 to 2.9 and the 2.9 MR AM might send a list of 
SignalContainerCommandProto to 2.8 NM? Maybe 2.8 NM just discards the message, 
not a big deal. Regardless, that is a separate issue that we don't need to 
address it here.

{noformat}
message SignalContainerRequestProto {

required SignalContainerCommandProto command = 2;
}
{noformat}

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, 
> MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, 
> MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, 
> Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2016-05-16 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285894#comment-15285894
 ] 

Ming Ma commented on MAPREDUCE-5044:


[~eepayne], my apologies for the delay.

* There was some discussion about combining signalContainer and stopContainers 
so that stopContainer is just a special case for signalContainer. And to 
support the "SIGTERM + delay + SIGKILL" used in stopContainers, we then need an 
ordered list of commands, thus the need for signalContainers. We don't need to 
deal with that at this point. But it might be useful to rename signalContainer 
to signalContainers so that we don't need to modify the API later, which means 
some new structure like {{SignalContainersRequest}}. What is your take?
* ContainerManagerImpl. It might be cleaner to abstract the common signal 
container code to a function used for both {{AM -> NM}} and {{RM -> NM}} cases.
* TaskAttemptImpl#PreemptedTransition. Given it is called only when the attempt 
is preempted, {{event.getType() == TaskAttemptEventType.TA_TIMED_OUT}} can be 
replaced by {{false}}.
* It will be useful to add an end-to-end new unit test, which can be found in 
Gera's original patch.
* Nit: ContainerLauncherImpl. Return value of 
{{getContainerManagementProtocol().signalContainer}} isn't used and can be 
removed.
* Nit: ContainerLauncherEvent has indent format issue.

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.008.patch, MAPREDUCE-5044.009.patch, 
> MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, 
> MAPREDUCE-5044.v04.patch, MAPREDUCE-5044.v05.patch, MAPREDUCE-5044.v06.patch, 
> MAPREDUCE-5044.v07.local.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, 
> Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6694) Make AM more resilient to potential lost of any completed container notification

2016-05-10 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-6694:
--

 Summary: Make AM more resilient to potential lost of any completed 
container notification
 Key: MAPREDUCE-6694
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6694
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Ming Ma


YARN tries to guarantee any completed container notification is delivered to AM 
under any circumstance, YARN-1372 is an example to make sure for the case of RM 
restart. However, under some corner cases, it is still possible a completed 
container notifications is lost or significantly delayed. For example, if NM 
host becomes dead when RM fails over.

AM won't preempt reducers if it thought there is at least one mapper running.
{noformat}
  void preemptReducesIfNeeded() {
...
if (assignedRequests.maps.size() > 0) {
  // there are assigned mappers
  return;
}
...
{noformat}

Instead of completely depending on notification from RM, it can use 
TaskUmbilicalProtocol to help to decide if there is any mapper running. That 
will make AM more resilient to any bugs in YARN.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6315) Implement retrieval of logs for crashed MR-AM via jhist in the staging directory

2016-04-25 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256607#comment-15256607
 ] 

Ming Ma commented on MAPREDUCE-6315:


Thanks [~jira.shegalov]! We definitely need to support this scenario. There was 
some discussion about providing clean-up functionality in YARN-2261 and 
MAPREDUCE-4428 so that history files can be moved to final location properly. 
But it isn't clear when we plan to provide such functionality at YARN layer and 
it seems like a larger effort. The patch here is more targeted and can take 
care of the issue we have had, at least until the end-to-end clean-up 
functionality is available. What do you think? Specific comments for the jira:

* This will enable "mapred job -logs" usage. How about the "jobhistory URL 
http://./jobhistory/job/job_ returns job not found" scenario, is it easy to 
add the redirect at that level?
* globStatus might return empty list. So it might be better to change from {{if 
(jhStats != null) }} to something like {{if (jhStats != null && jhStats.length 
> 0)}}.
* Is the output format change in JobHistoryParser required? Wonder if there is 
any backward compatibility issue if some tools have assumption about this.

> Implement retrieval of logs for crashed MR-AM via jhist in the staging 
> directory
> 
>
> Key: MAPREDUCE-6315
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6315
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client, mr-am
>Affects Versions: 2.7.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-6315.001.patch, MAPREDUCE-6315.002.patch, 
> MAPREDUCE-6315.003.patch
>
>
> When all AM attempts crash, there is no record of them in JHS. Thus no easy 
> way to get the logs. This JIRA automates the procedure by utilizing the jhist 
> file in the staging directory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics

2016-04-15 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243982#comment-15243982
 ] 

Ming Ma commented on MAPREDUCE-6660:


Thank you [~djp] and [~liuml07]! Sure let me update the unit test. Regarding 
the rename to BYTES_READ_LOCAL_DATACENTER, etc. it seems reasonable as it 
covers the common scenario, however it might not be general enough to cover all 
sort of topologies. For a large cluster with 4 tiers, /edge router/core 
switch/TOR/local machine, BYTES_READ_SECOND_OR_MORE_DEGREE_REMOTE_RACK might 
mean BYTES_READ_LOCAL_DATACENTER. What do you think?

> Add MR Counters for bytes-read-by-network-distance FileSystem metrics
> -
>
> Key: MAPREDUCE-6660
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png
>
>
> This is the MR part of the change which is to consume 
> bytes-read-by-network-distance metrics generated by 
> https://issues.apache.org/jira/browse/HDFS-9579.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics

2016-03-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-6660:
---
Assignee: Ming Ma
  Status: Patch Available  (was: Open)

> Add MR Counters for bytes-read-by-network-distance FileSystem metrics
> -
>
> Key: MAPREDUCE-6660
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png
>
>
> This is the MR part of the change which is to consume 
> bytes-read-by-network-distance metrics generated by 
> https://issues.apache.org/jira/browse/HDFS-9579.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics

2016-03-25 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-6660:
---
Attachment: MAPREDUCE-6660.png
MAPREDUCE-6660.patch

Here is the draft patch and the MR webUI. The webUI becomes somewhat busy given 
these new counters will be created for each FileSystem. We can consider skip 
the rows if the values are zero if there is no compatibility issue here.

> Add MR Counters for bytes-read-by-network-distance FileSystem metrics
> -
>
> Key: MAPREDUCE-6660
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Ming Ma
> Attachments: MAPREDUCE-6660.patch, MAPREDUCE-6660.png
>
>
> This is the MR part of the change which is to consume 
> bytes-read-by-network-distance metrics generated by 
> https://issues.apache.org/jira/browse/HDFS-9579.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6660) Add MR Counters for bytes-read-by-network-distance FileSystem metrics

2016-03-24 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-6660:
--

 Summary: Add MR Counters for bytes-read-by-network-distance 
FileSystem metrics
 Key: MAPREDUCE-6660
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6660
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Ming Ma


This is the MR part of the change which is to consume 
bytes-read-by-network-distance metrics generated by 
https://issues.apache.org/jira/browse/HDFS-9579.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6456) Support configurable log aggregation policy

2015-08-18 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-6456:
--

 Summary: Support configurable log aggregation policy
 Key: MAPREDUCE-6456
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6456
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Ming Ma


YARN-221 provides a way for a YARN application to specify log aggregation 
policy via LogAggregationContext.

This jira covers the necessary changes in MR to use that feature so that any MR 
job can specify its log aggregation policy via job configuration. That includes:

* Have MR define its own configurations to config these policies.
* Make code change at YarnRunner to retrieve these configurations and set the 
values via LogAggregationContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from mapred-default.xml) to branch-2

2015-07-15 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628434#comment-14628434
 ] 

Ming Ma commented on MAPREDUCE-5762:


+1. Thanks [~ajisakaa].

> Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from 
> mapred-default.xml) to branch-2
> ---
>
> Key: MAPREDUCE-5762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Akira AJISAKA
>Priority: Minor
> Attachments: MAPREDUCE-5762-branch-2-002.patch, 
> MAPREDUCE-5762-branch-2.03.patch, MAPREDUCE-5762-branch-2.patch, 
> MAPREDUCE-5762.003.branch-2.patch
>
>
> MRv1 configs are removed in trunk, but they are not removed in branch-2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from mapred-default.xml) to branch-2

2015-05-13 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542203#comment-14542203
 ] 

Ming Ma commented on MAPREDUCE-5762:


LGTM.

> Port MAPREDUCE-3223 and MAPREDUCE-4695 (Remove MRv1 config from 
> mapred-default.xml) to branch-2
> ---
>
> Key: MAPREDUCE-5762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Akira AJISAKA
>Priority: Minor
> Attachments: MAPREDUCE-5762-branch-2-002.patch, 
> MAPREDUCE-5762-branch-2.03.patch, MAPREDUCE-5762-branch-2.patch, 
> MAPREDUCE-5762.003.branch-2.patch
>
>
> MRv1 configs are removed in trunk, but they are not removed in branch-2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2

2015-05-12 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540199#comment-14540199
 ] 

Ming Ma commented on MAPREDUCE-5762:


[~ajisakaa], here is an example of test failures. The property is defined in 
trunk.

{noformat}
Running org.apache.hadoop.mapreduce.task.reduce.TestMerger
Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.625 sec <<< 
FAILURE! - in org.apache.hadoop.mapreduce.task.reduce.TestMerger
testEncryptedMerger(org.apache.hadoop.mapreduce.task.reduce.TestMerger)  Time 
elapsed: 0.092 sec  <<< ERROR!
java.lang.NullPointerException: null
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at 
org.apache.hadoop.mapred.MROutputFiles.getInputFileForWrite(MROutputFiles.java:206)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$InMemoryMerger.merge(MergeManagerImpl.java:459)
at 
org.apache.hadoop.mapreduce.task.reduce.TestMerger.testInMemoryAndOnDiskMerger(TestMerger.java:136)
at 
org.apache.hadoop.mapreduce.task.reduce.TestMerger.testEncryptedMerger(TestMerger.java:92)
{noformat}

> Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2
> 
>
> Key: MAPREDUCE-5762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Akira AJISAKA
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-5762-branch-2-002.patch, 
> MAPREDUCE-5762-branch-2.patch
>
>
> MRv1 configs are removed in trunk, but they are not removed in branch-2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5465) Tasks are often killed before they exit on their own

2015-05-11 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538833#comment-14538833
 ] 

Ming Ma commented on MAPREDUCE-5465:


Thanks Jason and Ray.

> Tasks are often killed before they exit on their own
> 
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, 
> MAPREDUCE-5465-branch-2.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5762) Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2

2015-05-11 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538608#comment-14538608
 ] 

Ming Ma commented on MAPREDUCE-5762:


[~ajisakaa] it looks like mapreduce.cluster.local.dir was removed as part of 
this and caused some branch-2 unit tests to fail.

> Port MAPREDUCE-3223 (Remove MRv1 config from mapred-default.xml) to branch-2
> 
>
> Key: MAPREDUCE-5762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5762
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Akira AJISAKA
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-5762-branch-2-002.patch, 
> MAPREDUCE-5762-branch-2.patch
>
>
> MRv1 configs are removed in trunk, but they are not removed in branch-2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2015-05-11 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Attachment: MAPREDUCE-5465-branch-2.patch

Thanks [~jlowe] for spending time on this. For the handling of TA_KILL event 
when TA is in SUCCESS_CONTAINER_CLEANUP, do you mean 
https://issues.apache.org/jira/browse/MAPREDUCE-5776? If so, we can address the 
issue in that jira. Here is the patch for branch-2.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, 
> MAPREDUCE-5465-branch-2.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2015-05-08 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Labels: BB2015-05-RFC  (was: BB2015-05-TBR)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
>  Labels: BB2015-05-RFC
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2015-04-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Attachment: MAPREDUCE-5465-9.patch

Ray, here is the rebased patch.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465-9.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2015-03-24 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378161#comment-14378161
 ] 

Ming Ma commented on MAPREDUCE-5465:


[~rchiang], thanks for looking into this.

SUCCESS_CONTAINER_CLEANUP can be transitioned from SUCCESS_FINISHING_CONTAINER. 
For ExitFinishingOnTimeoutTransition , you can search for 
FINISHING_ON_TIMEOUT_TRANSITION.

We have been running a slight different version of this patch in our production 
clusters for a while. I can rebase the patch for trunk if people are interested 
in it. 

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-6135) Job staging directory remains if MRAppMaster is OOM

2014-10-23 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma resolved MAPREDUCE-6135.

Resolution: Duplicate

Thanks, Jason. Resolve this as dup. Will continue the discussion over at 
MAPREDUCE-5502. It looks like Robert in MAPREDUCE-4428 also mentioned the 
approach of rerun AM for cleanup.

> Job staging directory remains if MRAppMaster is OOM
> ---
>
> Key: MAPREDUCE-6135
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Ming Ma
>
> If MRAppMaster attempts run out of memory, it won't go through the normal job 
> clean up process to move history files to history server location. When 
> customers try to find out why the job failed, the data won't be available on 
> history server webUI.
> The work around is to extract the container id and NM id from the jhist file 
> in the job staging directory; then use "yarn logs" command to get the AM logs.
> It would be great the platform can take care of it by moving these hist files 
> automatically to history server if AM attempts don't exit properly.
> We discuss ideas on how to address this and would like get suggestions from 
> others. Not sure if timeline server design covers this scenario.
> 1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max 
> attempt, please clean up". For example, YARN can launch AppMaster one more 
> time after AM max attempt and MRAppMaster use that as the indication this is 
> clean-up-only attempt.
> 2. Have some program periodically check job statuses and move files from job 
> staging directory to history server for those finished jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6135) Job staging directory remains if MRAppMaster is OOM

2014-10-22 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-6135:
--

 Summary: Job staging directory remains if MRAppMaster is OOM
 Key: MAPREDUCE-6135
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ming Ma


If MRAppMaster attempts run out of memory, it won't go through the normal job 
clean up process to move history files to history server location. When 
customers try to find out why the job failed, the data won't be available on 
history server webUI.

The work around is to extract the container id and NM id from the jhist file in 
the job staging directory; then use "yarn logs" command to get the AM logs.

It would be great the platform can take care of it by moving these hist files 
automatically to history server if AM attempts don't exit properly.

We discuss ideas on how to address this and would like get suggestions from 
others. Not sure if timeline server design covers this scenario.

1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max 
attempt, please clean up". For example, YARN can launch AppMaster one more time 
after AM max attempt and MRAppMaster use that as the indication this is 
clean-up-only attempt.

2. Have some program periodically check job statuses and move files from job 
staging directory to history server for those finished jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139518#comment-14139518
 ] 

Ming Ma commented on MAPREDUCE-5891:


Junping, Jason, the patch looks good to me.

> Improved shuffle error handling across NM restarts
> --
>
> Key: MAPREDUCE-5891
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Junping Du
> Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891-v5.patch, 
> MAPREDUCE-5891-v6.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127212#comment-14127212
 ] 

Ming Ma commented on MAPREDUCE-5891:


The patch looks good. I like Jason's idea to have 
mapreduce.reduce.shuffle.fetch.retry.enabled use 
${yarn.nodemanager.recovery.enabled} as default value. As for the other 
approaches,

a) dynamic MR to YARN query, given NM recovery flag is a global cluster level 
setting ( although it is possible to config it on per NM basis ), can we derive 
the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission 
time from some YARN API call to RM?

b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http 
header via property key names. So if we add a new property to indicate if 
recovery is supported and continue to keep the same http "version" property, 
new version of fetcher might be able to work with old version of 
shufflehandler, and vise versa.

> Improved shuffle error handling across NM restarts
> --
>
> Key: MAPREDUCE-5891
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Junping Du
> Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-08 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126643#comment-14126643
 ] 

Ming Ma commented on MAPREDUCE-5891:


Thanks, Junping. Regarding the default value, 
mapreduce.reduce.shuffle.fetch.retry.enabled is set to true by default; while 
NM recovery is set to false by default. That means by default, fetcher will 
retry even though shufflehandler won't be able to serve mapper outputs after 
restart. It doesn't seem like a big deal. Just want to call out if that is 
intentional. Do we foresee other scenarios where fetch retry will be useful? If 
not, reducers can ask YARN if NM recovery is enabled or reducers can ask 
shufflehandler if recovery is enabled without defining this retry property.

> Improved shuffle error handling across NM restarts
> --
>
> Key: MAPREDUCE-5891
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Junping Du
> Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-09-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Attachment: (was: MAPREDUCE-5465-8.patch)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-09-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Attachment: MAPREDUCE-5465-8.patch

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-09-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Attachment: MAPREDUCE-5465-8.patch

Updated patch to address javac warnings.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465-8.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-09-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Status: Patch Available  (was: Open)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-09-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---
Status: Open  (was: Patch Available)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts

2014-09-02 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118290#comment-14118290
 ] 

Ming Ma commented on MAPREDUCE-5891:


Thanks, Junping, Jason for the useful patch.

In the case slowstart is set to some small value, the reducer will fetch some 
mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is 
set to some old value due to early NM host A restart, and thus mark fetcher 
retry timed out when it later tries to handle NM host B restart?

To make sure fetcher doesn't unnecessarily retry for the decommission scenario, 
it seems the assumption is we will have some sort of graceful decommission 
support so that during decommission process the fetcher will still be able to 
get mapper output. Is it true?

If we get time to do YARN-1593, that will further reduce the chance of shuffle 
handler restart. Any opinion on that?

> Improved shuffle error handling across NM restarts
> --
>
> Key: MAPREDUCE-5891
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 2.5.0
>Reporter: Jason Lowe
>Assignee: Junping Du
> Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-7.patch

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: (was: MAPREDUCE-5465-7.patch)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Status: Patch Available  (was: Open)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Status: Open  (was: Patch Available)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Affects Version/s: (was: 2.0.3-alpha)
   Status: Open  (was: Patch Available)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Affects Version/s: (was: trunk)
   Status: Patch Available  (was: Open)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-08-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-7.patch

Sorry for the delay.

Jason, here is the new patch with your suggestion of "to have TA notify task 
TA_ATTEMPT_SUCCEEDED or TA_ATTEMPT_FAILED after it receives notification from 
TaskUmbilicalProtocol" and code clean up. Appreciate your input.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465-7.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-21 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106250#comment-14106250
 ] 

Ming Ma commented on MAPREDUCE-4815:


To have first task's recoverTask recover all succeeded tasks seems to work 
functionality wise. If the first task fails to recoverTask due to fs.rename 
exception, it will be rescheduled; the second task's recoverTask can continue 
to recover the succeeded tasks.

It does change the semantics of recoverTask. It is no longer done on per task 
basis. But perhaps we can treat it as an optimization; other OutputCommitter 
implementations can still choose to have recovery on per task basis.

For the upgrade scenario, how does it clean up the succeeded task attempt data 
in the old scheme?

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> --
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>Reporter: Jason Lowe
>Assignee: Siqi Li
> Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, 
> MAPREDUCE-4815.v5.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

2014-08-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101801#comment-14101801
 ] 

Ming Ma commented on MAPREDUCE-4815:


Siqi, thanks for the patch. It looks good overall.

Will the upgrade scenario work for task recovery? It seems the data stored in 
the old output structure won't be honored by the new scheme after cluster 
restarts with the patch; it skips the recoverTask given TempOutputPath doesn't 
exist.

If that is the case, perhaps the patch can through some exception so that the 
task attempt state changes to KILLED state for retry. Alternatively, the new 
patch can be modified to handle the recovery of old directory structure, but 
that seems over complicated.


> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> --
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>Reporter: Jason Lowe
>Assignee: Siqi Li
> Attachments: MAPREDUCE-4815.v3.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-207) Computing Input Splits on the MR Cluster

2014-06-27 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046497#comment-14046497
 ] 

Ming Ma commented on MAPREDUCE-207:
---

Thanks, Gera. Nice work and this will be quite useful. Overall it looks good. 
Per offline discussion with Gera,

1. It is unclear if there is any security related implication such as 
https://issues.apache.org/jira/browse/MAPREDUCE-5663.
2. The compatibility between new MR client with this feature and cluster with 
old MR. Given new MR client won't compute the split by default; the job will 
fail if the cluster still uses old MR. So in this case, new MR client needs to 
be configured to compute split. For a more general case where new MR client can 
talk to some cluster with old MR and some cluster with new MR, it will be nice 
if client can discover if the cluster supports this feature.

> Computing Input Splits on the MR Cluster
> 
>
> Key: MAPREDUCE-207
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-207
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: applicationmaster, mrv2
>Reporter: Philip Zeyliger
>Assignee: Arun C Murthy
> Attachments: MAPREDUCE-207.patch, MAPREDUCE-207.v02.patch, 
> MAPREDUCE-207.v03.patch, MAPREDUCE-207.v05.patch
>
>
> Instead of computing the input splits as part of job submission, Hadoop could 
> have a separate "job task type" that computes the input splits, therefore 
> allowing that computation to happen on the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5790) Default map hprof profile options do not work

2014-05-19 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002782#comment-14002782
 ] 

Ming Ma commented on MAPREDUCE-5790:


It appears https://issues.apache.org/jira/browse/MAPREDUCE-5650 set  

mapreduce.task.profile.map.params
${mapreduce.task.profile.params}

The reading code doesn't like it.

If you remove the default setting, things will work fine. Perhaps we can leave 
the default value empty so profile works out of box.

> Default map hprof profile options do not work
> -
>
> Key: MAPREDUCE-5790
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5790
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.3.0
> Environment: java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
>Reporter: Andrew Wang
>
> I have an MR job doing the following:
> {code}
> Job job = Job.getInstance(conf);
> // Enable profiling
> job.setProfileEnabled(true);
> job.setProfileTaskRange(true, "0");
> job.setProfileTaskRange(false, "0");
> {code}
> When I run this job, some of my map tasks fail with this error message:
> {noformat}
> org.apache.hadoop.util.Shell$ExitCodeException: 
> /data/5/yarn/nm/usercache/hdfs/appcache/application_1394482121761_0012/container_1394482121761_0012_01_41/launch_container.sh:
>  line 32: $JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN   -Xmx825955249 -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41
>  -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
> ${mapreduce.task.profile.params} org.apache.hadoop.mapred.YarnChild 
> 10.20.212.12 43135 attempt_1394482121761_0012_r_00_0 41 
> 1>/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41/stdout
>  
> 2>/var/log/hadoop-yarn/container/application_1394482121761_0012/container_1394482121761_0012_01_41/stderr
>  : bad substitution
> {noformat}
> It looks like ${mapreduce.task.profile.params} is not getting subbed in 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-05-11 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994140#comment-13994140
 ] 

Ming Ma commented on MAPREDUCE-5465:


Thanks, Jason! We have discussed the performance implication in 
https://issues.apache.org/jira/browse/YARN-221. It is good to revisit the issue.

1. I assume job latency is the metric we want to use. The question is how much 
such change impacts the job latency.

2. Say umbilical notification is at t1, task receives T_ATTEMPT_SUCCEEDED or 
T_ATTEMPT_FAILED at t2, MRAppMaster acquires new containers from RM for next 
set of tasks at t3.

3. How much does (t2-t1) impact job latency? It depends on the job 
characteristics. mapper output can be available sooner; reducer containers can 
be scheduled sooner, etc. But it isn't going to be linear to number of tasks; 
given tasks run in parallel. So it should be much smaller. I don't have the 
formula. It will be useful to compare the performance difference using actual 
jobs.

4. Your suggestion of notifying task/job right after t1 is a good idea to 
improve (t2-t1). I assume it doesn't change the state transition of task 
attempt. We need to confirm state machine correctness point of view, given 
there might be some assumptions between task attempt and task state machines.

5. (t3-t1) can also impact job latency. Notifying task/job earlier won't help 
to improve (t3-t1).

6. To improve (t3-t1), perhaps when container exits, it should send 
OutofBandHeartBeat. Currently OutofBandHeartBeat is sent only when 
stopContainer is called. Perhaps This is useful when NM->RM's heartbeat 
interval is big.

7. It appears there is some issue w.r.t. the current stopContainer's calling 
NodeStatusUpdaterImpl's OutofBandHeartBeat processing. stopContainer first 
enqueues "kill" container event before calling NodeStatusUpdaterImpl's 
OutofBandHeartBeat. So it is possible the NodeStatusUpdaterImpl heartbeat 
thread sends the heartbeat to RM before the main Dispatcher thread processes 
the event and mark the container as completed. Thus the OutofBandHeartBeat 
doesn't include that container in the completed container list. Does it really 
need to call NodeStatusUpdaterImpl's OutofBandHeartBeat in stopContainer? It 
seems it is better to call it only when a container exits.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-05-05 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-6.patch

Merge batch with latest trunk.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-05-01 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987115#comment-13987115
 ] 

Ming Ma commented on MAPREDUCE-5652:


Sounds good, we can use a new jira to cover "best effort" work.

The patch looks good. Just to confirm, protobuf should be backward compatible, 
e.g., the store state serialized with version 2.4 should be readable by NM/MR 
compiled with version 2.5.

On an unrelated note, based on how NM's AuxServices' serviceStart handles error 
for each AuxService' serviceStart, if one AuxService throws some exception, the 
rest of AuxServices' serviceStart will be skipped. That isn't important given 
we only have one AuxService. Perhaps there is some policy around that as well, 
should NM skip failed AuxService? It seems in general we might need to improve 
AuxService handling if there are other AuxServices.

> NM Recovery. ShuffleHandler should handle NM restarts
> -
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Jason Lowe
>  Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
> MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, 
> MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running 
> map-tasks. On NM restart, the map outputs are cleaned up requiring 
> re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-04-30 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985932#comment-13985932
 ] 

Ming Ma commented on MAPREDUCE-5652:


1. Regarding generic interface for restore/recover,  I agree there is no much 
benefit to generalize things for the sake of it. One scenario could be 
something like ShuffleHandler, some ShuffleHandlers support recovery, some 
don't. NM can ask if a specific ShuffleHandler if it supports recovery, NM will 
manage the underlying store and pass the store object to ShuffleHandler and 
ShuffleHandler manages the serialization and deserialization, etc. If NM 
decides to change the underlying store and ShuffleHandler doesn't need to 
change. But at this point, it seems unnecessary.
2. If ShuffleHandler gets DBException during recoverState as part of 
serviceStart, should ShuffleHandler ignore the exception and continue like the 
store doesn't exist? The argument for ignoring it is it is soft state and 
ShuffleHandler can still run without it. Or maybe this can be configurable.

> NM Recovery. ShuffleHandler should handle NM restarts
> -
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Jason Lowe
>  Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
> MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running 
> map-tasks. On NM restart, the map outputs are cleaned up requiring 
> re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-04-23 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978830#comment-13978830
 ] 

Ming Ma commented on MAPREDUCE-5652:


Thanks, Jason. It is good to know it will be taken care of at YARN layer. I 
will post some more comments at YARN-1336.

1. Does leveDB's delete method throw exception? JNI has some exception handling 
and the caller needs to retrieve the exceptions, etc.
2. It seems like recover/restore are common in NM/RM restart. Any abstract 
interface defined for that?

> NM Recovery. ShuffleHandler should handle NM restarts
> -
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Jason Lowe
>  Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
> MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running 
> map-tasks. On NM restart, the map outputs are cleaned up requiring 
> re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-04-23 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977938#comment-13977938
 ] 

Ming Ma commented on MAPREDUCE-5652:


Nice work. Jason, I would like to clarify how the following scenarios are 
handled. Perhaps they are covered at the YARN layer as part of 
https://issues.apache.org/jira/browse/YARN-1336.

1. NM crash scenario. There is a corner case, after RM notifies NM regarding 
the completion of a specific application, right before AuxServices get the 
chance to process the event, NM crashes. The app entry won't be removed after 
the recovery store after NM is restarted, as APPLICATION_STOP won't be 
delivered to NM for that application after NM restart.

2. NM graceful shutdown. It seems ContainerManagerImpl's serviceStop will 
generate ContainerManagerEventType.FINISH_APPS event. That means AuxServices 
could clean up and remove it from the recovery store as part of NM shutdown.

> NM Recovery. ShuffleHandler should handle NM restarts
> -
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Jason Lowe
>  Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
> MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running 
> map-tasks. On NM restart, the map outputs are cleaned up requiring 
> re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-04-22 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-5.patch

Updated version that fixes javac and findbug warning.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-04-18 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-4.patch

Updates per Jason's suggestions.

1. This patch also includes fix for 
https://issues.apache.org/jira/browse/MAPREDUCE-5835. Otherwise, some unit 
tests might fail due to new states introduced.
2. Fix the handling of TA_CONTAINER_COMPLETED for other cases as well. For 
example if TA receives TA_CONTAINER_COMPLETED when it is in RUNNING state, it 
doesn't need to transition to FAIL_CONTAINER_CLEANUP to clean up container. 

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state

2014-04-14 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5835:
---

Attachment: MAPREDUCE-5835.patch

Here is the patch with the fix and the unit test that reproduces the race 
condition. There might be other ways to fix the issues.

> Killing Task might cause the job to go to ERROR state
> -
>
> Key: MAPREDUCE-5835
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5835.patch
>
>
> There could be a race condition if job is killed right after task attempt 
> receives TA_DONE event. In that case, TaskImpl might receive 
> T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus 
> transition job to ERROR state.
> a. The task is in KILL_WAIT.
> b. TA receives TA_DONE event.
> c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event.
> d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the 
> task. The task transitions to KILLED state.
> e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task.
> f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus 
> transition job to ERROR state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state

2014-04-14 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5835:
---

Status: Patch Available  (was: Open)

> Killing Task might cause the job to go to ERROR state
> -
>
> Key: MAPREDUCE-5835
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5835.patch
>
>
> There could be a race condition if job is killed right after task attempt 
> receives TA_DONE event. In that case, TaskImpl might receive 
> T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus 
> transition job to ERROR state.
> a. The task is in KILL_WAIT.
> b. TA receives TA_DONE event.
> c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event.
> d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the 
> task. The task transitions to KILLED state.
> e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task.
> f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus 
> transition job to ERROR state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5835) Killing Task might cause the job to go to ERROR state

2014-04-14 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-5835:
--

 Summary: Killing Task might cause the job to go to ERROR state
 Key: MAPREDUCE-5835
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5835
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma


There could be a race condition if job is killed right after task attempt 
receives TA_DONE event. In that case, TaskImpl might receive 
T_ATTEMPT_SUCCEEDED followed by T_ATTEMPTED_KILLED for the same attempt, thus 
transition job to ERROR state.

a. The task is in KILL_WAIT.
b. TA receives TA_DONE event.
c. Before TA transitions to SUCCEEDED state, Task sends TA_KILL event.
d. TA transitions to SUCCEEDED state and thus send T_ATTEMPT_SUCCEEDED to the 
task. The task transitions to KILLED state.
e. TA processes TA_KILL event and sends T_ATTEMPT_KILLED to the task.
f. When task is in KILLED state, it can't handle T_ATTEMPT_KILLED event, thus 
transition job to ERROR state.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-04-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964753#comment-13964753
 ] 

Ming Ma commented on MAPREDUCE-5465:


Thanks Jason for the review. I will upload the updated patch soon. Want to 
comment on the couple points you mentioned.

1. Yes, putting finishTaskMonitor under TaskAttemptListenerImpl isn't clean, 
given TaskAttemptListenerImpl should only deal with TaskUmbilicalProtocol 
related. I will move it out to AppContext layer.
2.  Handling of TA_FAILMSG event.  TA_FAILMSG can be triggered by task JVM as 
well as user via "hadoop job -fail-task command". For the case where task JVM 
reports failure, yes, it can wait for the container to exit. For the case where 
end users send the command, it will need to clean up the container right away. 
I skipped that for simplicity. If we want to support that, it seems we will 
need a new event like TA_FAILMSG_BY_USER.
3. Why are we transitioning from FINISHING_CONTAINER to 
SUCCESS_CONTAINER_CLEANUP rather than to SUCCEEDED when we receive a container 
completed event? It was done for simplicity so that all successful states will 
go to SUCCESS_CONTAINER_CLEANUP first. But I agree it can go directly to 
SUCCEEDED when we receive a container completed event.

  

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-04-04 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-3.patch

Updated patch for the latest trunk.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

2014-03-31 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955652#comment-13955652
 ] 

Ming Ma commented on MAPREDUCE-5044:


This is quite useful. Can we get this and YARN-1515 in 2.4.0 release?

> Have AM trigger jstack on task attempts that timeout before killing them
> 
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, 
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at 
> 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack 
> output via SIGQUIT before killing the task attempt.  This would be invaluable 
> for helping users debug their hung tasks, especially if they do not have 
> shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5784) CLI update so that people can send signal to a specific task

2014-03-07 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-5784:
--

 Summary: CLI update so that people can send signal to a specific 
task
 Key: MAPREDUCE-5784
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5784
 Project: Hadoop Map/Reduce
  Issue Type: Task
Reporter: Ming Ma


This depends on https://issues.apache.org/jira/browse/YARN-445. MR client will 
first find out the container id for the specified task. Then it will use YARN 
API to signal the container.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5783) web UI update to allow people to request thread dump of a running task

2014-03-07 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-5783:
--

 Summary: web UI update to allow people to request thread dump of a 
running task
 Key: MAPREDUCE-5783
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5783
 Project: Hadoop Map/Reduce
  Issue Type: Task
  Components: webapps
Reporter: Ming Ma


This depends on https://issues.apache.org/jira/browse/YARN-445.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5776) Improve TaskAttempt's handling of TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state

2014-03-04 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5776:
---

Summary: Improve TaskAttempt's handling of TA_KILL event when TA is in 
SUCCESS_CONTAINER_CLEANUP state  (was: TaskAttempt should honor TA_KILL event 
when TA is in SUCCESS_CONTAINER_CLEANUP state)

> Improve TaskAttempt's handling of TA_KILL event when TA is in 
> SUCCESS_CONTAINER_CLEANUP state
> -
>
> Key: MAPREDUCE-5776
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5776
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Ming Ma
>
> In most states that a TaskAttempt goes through, such as ASSIGNED, RUNNING, 
> SUCCEEDED etc. If a TA receives TA_KILL, the state will transit to KILLED (if 
> the TA is in SUCCEEDED state, it depends on if it is a reducer task).
> However, If the TA is in SUCCESS_CONTAINER_CLEANUP state, TA just ignores 
> TA_KILL. Later on, SUCCESS_CONTAINER_CLEANUP will move to SUCCEEDED state 
> after the container is cleaned up. So it is possible after a client issue a 
> kill request, the TA will eventually be in SUCCEEDED state. It isn't a major 
> issue. But from consistency's point of view, it is better if TA_KILL is 
> handled in a similar way as how it is handled when TA is in SUCCEEDED state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5776) TaskAttempt should honor TA_KILL event when TA is in SUCCESS_CONTAINER_CLEANUP state

2014-03-04 Thread Ming Ma (JIRA)
Ming Ma created MAPREDUCE-5776:
--

 Summary: TaskAttempt should honor TA_KILL event when TA is in 
SUCCESS_CONTAINER_CLEANUP state
 Key: MAPREDUCE-5776
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5776
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ming Ma


In most states that a TaskAttempt goes through, such as ASSIGNED, RUNNING, 
SUCCEEDED etc. If a TA receives TA_KILL, the state will transit to KILLED (if 
the TA is in SUCCEEDED state, it depends on if it is a reducer task).

However, If the TA is in SUCCESS_CONTAINER_CLEANUP state, TA just ignores 
TA_KILL. Later on, SUCCESS_CONTAINER_CLEANUP will move to SUCCEEDED state after 
the container is cleaned up. So it is possible after a client issue a kill 
request, the TA will eventually be in SUCCEEDED state. It isn't a major issue. 
But from consistency's point of view, it is better if TA_KILL is handled in a 
similar way as how it is handled when TA is in SUCCEEDED state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-03-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Affects Version/s: 2.0.3-alpha

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk, 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-03-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Affects Version/s: (was: 2.0.3-alpha)
   trunk
   Status: Patch Available  (was: Open)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: trunk
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-03-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-5465:
---

Attachment: MAPREDUCE-5465-2.patch

Here is the patch.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-02-28 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916964#comment-13916964
 ] 

Ming Ma commented on MAPREDUCE-5465:


I discussed with Ravi offline and will provide the patch for review soon.

The basic approach is to define a new state called FINISHING_CONTAINER for 
TaskAttemptStateInternal. TaskAttempt will transition to this new state after 
it receives TaskUmbilicalProtocol's done notification from the task JVM. This 
will give a chance for the container to exit by itself. Normally the attempt 
will receive container exit notification via NM -> RM -> AM route; if it 
doesn't get the notification in time, it will time out and clean up the 
container via stopContainer.

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

2014-02-28 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma reassigned MAPREDUCE-5465:
--

Assignee: Ming Ma  (was: Ravi Prakash)

> Container killed before hprof dumps profile.out
> ---
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mr-am, mrv2
>Affects Versions: 2.0.3-alpha
>Reporter: Radim Kolar
>Assignee: Ming Ma
> Attachments: MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAPREDUCE-4710) Add peak memory usage counter for each task

2014-01-07 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864833#comment-13864833
 ] 

Ming Ma commented on MAPREDUCE-4710:


A general question, should NM provide such data at the container level? It 
seems we need that information to support preemption and fairness anyway; NM 
needs to inform RM the actual resource utilization at container level; memory 
usage is one of the resource metrics. Currently ContainerStatus doesn't provide 
that level of details.


> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-4710
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4710
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: task
>Affects Versions: 1.0.2, trunk
>Reporter: Cindy Li
>Assignee: Cindy Li
>Priority: Minor
>  Labels: patch
> Fix For: trunk
>
> Attachments: MAPREDUCE-4710-trunk.patch, mapreduce-4710-v1.0.2.patch, 
> mapreduce-4710.patch, mapreduce4710-v3.patch, mapreduce4710-v6.patch, 
> mapreduce4710.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAPREDUCE-4710) Add peak memory usage counter for each task

2013-11-07 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816725#comment-13816725
 ] 

Ming Ma commented on MAPREDUCE-4710:


It doesn’t seem to be MR application specific, other YARN application might 
want this as well. Should it be done at NM level so that there are general 
container peak memory usage data?

> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-4710
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4710
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: task
>Affects Versions: 1.0.2
>Reporter: Cindy Li
>Assignee: Cindy Li
>Priority: Minor
>  Labels: patch
> Attachments: MAPREDUCE-4710-trunk.patch, mapreduce-4710-v1.0.2.patch, 
> mapreduce-4710.patch, mapreduce4710.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-09-09 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101649#comment-13101649
 ] 

Ming Ma commented on MAPREDUCE-2779:


Arun, the bug is still in the trunk. Thanks.

> JobSplitWriter.java can't handle large job.split file
> -
>
> Key: MAPREDUCE-2779
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Affects Versions: 0.20.205.0, 0.22.0, 0.23.0
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-2779-trunk.patch
>
>
> We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
> job.split used internally by hadoop, sometimes it can go beyond 2GB.
> In JobSplitWriter.java, the function that generates such file uses 32bit 
> signed integer to compute offset into job.split.
> writeNewSplits
> ...
> int prevCount = out.size();
> ...
> int currCount = out.size();
> writeOldSplits
> ...
>   long offset = out.size();
> ...
>   int currLen = out.size();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-09-02 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096548#comment-13096548
 ] 

Ming Ma commented on MAPREDUCE-2779:


It is tested on 0.20-security-* branches. Testing on 0.22 will be conducted 
later.

> JobSplitWriter.java can't handle large job.split file
> -
>
> Key: MAPREDUCE-2779
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Affects Versions: 0.20.205.0, 0.22.0, 0.23.0
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-2779-trunk.patch
>
>
> We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
> job.split used internally by hadoop, sometimes it can go beyond 2GB.
> In JobSplitWriter.java, the function that generates such file uses 32bit 
> signed integer to compute offset into job.split.
> writeNewSplits
> ...
> int prevCount = out.size();
> ...
> int currCount = out.size();
> writeOldSplits
> ...
>   long offset = out.size();
> ...
>   int currLen = out.size();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-08-05 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-2779:
---

Affects Version/s: 0.20.205.0
   Status: Patch Available  (was: Open)

> JobSplitWriter.java can't handle large job.split file
> -
>
> Key: MAPREDUCE-2779
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Affects Versions: 0.20.205.0, 0.22.0, 0.23.0
>Reporter: Ming Ma
> Attachments: MAPREDUCE-2779-trunk.patch
>
>
> We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
> job.split used internally by hadoop, sometimes it can go beyond 2GB.
> In JobSplitWriter.java, the function that generates such file uses 32bit 
> signed integer to compute offset into job.split.
> writeNewSplits
> ...
> int prevCount = out.size();
> ...
> int currCount = out.size();
> writeOldSplits
> ...
>   long offset = out.size();
> ...
>   int currLen = out.size();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-08-05 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-2779:
---

Affects Version/s: 0.23.0
   0.22.0

> JobSplitWriter.java can't handle large job.split file
> -
>
> Key: MAPREDUCE-2779
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Affects Versions: 0.22.0, 0.23.0
>Reporter: Ming Ma
> Attachments: MAPREDUCE-2779-trunk.patch
>
>
> We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
> job.split used internally by hadoop, sometimes it can go beyond 2GB.
> In JobSplitWriter.java, the function that generates such file uses 32bit 
> signed integer to compute offset into job.split.
> writeNewSplits
> ...
> int prevCount = out.size();
> ...
> int currCount = out.size();
> writeOldSplits
> ...
>   long offset = out.size();
> ...
>   int currLen = out.size();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-08-04 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated MAPREDUCE-2779:
---

Attachment: MAPREDUCE-2779-trunk.patch

> JobSplitWriter.java can't handle large job.split file
> -
>
> Key: MAPREDUCE-2779
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Reporter: Ming Ma
> Attachments: MAPREDUCE-2779-trunk.patch
>
>
> We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
> job.split used internally by hadoop, sometimes it can go beyond 2GB.
> In JobSplitWriter.java, the function that generates such file uses 32bit 
> signed integer to compute offset into job.split.
> writeNewSplits
> ...
> int prevCount = out.size();
> ...
> int currCount = out.size();
> writeOldSplits
> ...
>   long offset = out.size();
> ...
>   int currLen = out.size();

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAPREDUCE-2779) JobSplitWriter.java can't handle large job.split file

2011-08-04 Thread Ming Ma (JIRA)
JobSplitWriter.java can't handle large job.split file
-

 Key: MAPREDUCE-2779
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2779
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: job submission
Reporter: Ming Ma


We use cascading MultiInputFormat. MultiInputFormat sometimes generates big 
job.split used internally by hadoop, sometimes it can go beyond 2GB.

In JobSplitWriter.java, the function that generates such file uses 32bit signed 
integer to compute offset into job.split.


writeNewSplits
...
int prevCount = out.size();
...
int currCount = out.size();

writeOldSplits
...
  long offset = out.size();
...
  int currLen = out.size();


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira