[jira] [Moved] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen moved YARN-1999 to MAPREDUCE-5874:
--

  Component/s: (was: documentation)
   documentation
 Target Version/s: 2.5.0  (was: 2.5.0)
Affects Version/s: (was: 2.4.0)
   2.4.0
  Key: MAPREDUCE-5874  (was: YARN-1999)
  Project: Hadoop Map/Reduce  (was: Hadoop YARN)

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987433#comment-13987433
 ] 

Zhijie Shen commented on MAPREDUCE-5874:


+1 for reorganizing the web pages. For clarity, maybe it's better to say Job 
History Server  instead?

Move the ticket to MR

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987440#comment-13987440
 ] 

Zhijie Shen commented on MAPREDUCE-5874:


or MR History Server as the section header.

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987465#comment-13987465
 ] 

Hadoop QA commented on MAPREDUCE-5874:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642614/YARN-1999.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation patch that doesn't require tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4575//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4575//console

This message is automatically generated.

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5809) Enhance distcp to support preserving HDFS ACLs.

2014-05-02 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987512#comment-13987512
 ] 

Akira AJISAKA commented on MAPREDUCE-5809:
--

There are some unused imports in ScopedAclEntries.java and DistCpUtils.java.
I'm +1 (non-binding) once that is addressed.

 Enhance distcp to support preserving HDFS ACLs.
 ---

 Key: MAPREDUCE-5809
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5809
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: 2.4.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Attachments: MAPREDUCE-5809.1.patch, MAPREDUCE-5809.2.patch


 This issue tracks enhancing distcp to add a new command-line argument for 
 preserving HDFS ACLs from the source at the copy destination.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-05-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987668#comment-13987668
 ] 

Jason Lowe commented on MAPREDUCE-5652:
---

bq. Just to confirm, protobuf should be backward compatible, e.g., the store 
state serialized with version 2.4 should be readable by NM/MR compiled with 
version 2.5.

Yes, the protobuf incompatibility between 2.4 and 2.5 is an issue with the 
interfaces to the protobuf code and not an incompatibility with the data layout 
of protobuf messages.

bq. On an unrelated note, based on how NM's AuxServices' serviceStart handles 
error for each AuxService' serviceStart, if one AuxService throws some 
exception, the rest of AuxServices' serviceStart will be skipped.

I might be reading the code incorrectly, but it looks like 
AuxServices#serviceStart doesn't try to handle exceptions coming from 
individual aux services at all.  If one of their startups throws then it will 
be converted into a RuntimeException (by AbstractService#start) which will 
bubble up out of AuxServices and likely all the way up such that the NM startup 
will fail.

As you pointed out before, a better way to handle aux services would be to run 
them outside of the NM (maybe even within containers).  It'd also be nice to 
make them more dynamic, such that application submissions can provide an aux 
service they require.  They could be started on demand, ref-counted, and 
stopped accordingly based on usage, which would be a smoother answer to rolling 
upgrades and sharing multiple versions of the aux services within the cluster.  
This is of course a non-trivial amount of work for another JIRA. ;-)

 NM Recovery. ShuffleHandler should handle NM restarts
 -

 Key: MAPREDUCE-5652
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Jason Lowe
  Labels: shuffle
 Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
 MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
 MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, 
 MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch


 ShuffleHandler should work across NM restarts and not require re-running 
 map-tasks. On NM restart, the map outputs are cleaned up requiring 
 re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5874:
--

Attachment: MAPREDUCE-5874.2.patch

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5874.2.patch, YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987753#comment-13987753
 ] 

Tsuyoshi OZAWA commented on MAPREDUCE-5874:
---

Thanks for your review, Akira and Zhijie. Updated a patch as follows:
* Changed the section name from History Server to MR History Server
* Removed dead links from YARN REST APIs' section
* Updated a sentence Akira pointed out
* Moved HistoryServerRest.apt.vm correctly

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5874.2.patch, YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5874) Creating MapReduce REST API section

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987793#comment-13987793
 ] 

Hadoop QA commented on MAPREDUCE-5874:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12643047/MAPREDUCE-5874.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation patch that doesn't require tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4576//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4576//console

This message is automatically generated.

 Creating MapReduce REST API section
 ---

 Key: MAPREDUCE-5874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5874
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Ravi Prakash
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5874.2.patch, YARN-1999.1.patch


 Now that we have the YARN HistoryServer, perhaps we should move 
 HistoryServerRest.apt.vm and MapRedAppMasterRest.apt.vm into the MapReduce 
 section where it really belongs?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987797#comment-13987797
 ] 

Tsuyoshi OZAWA commented on MAPREDUCE-5402:
---

Waiting for Jenkins.

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Open  (was: Patch Available)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Patch Available  (was: Open)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5871) Estimate Job Endtime

2014-05-02 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAPREDUCE-5871:
---

Status: Patch Available  (was: Open)

 Estimate Job Endtime
 

 Key: MAPREDUCE-5871
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5871
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: MAPREDUCE-5871.patch


 YARN-1969 adds a new earliest-endtime-first policy to the fair scheduler. As 
 a prerequisite step, the AppMaster should estimate its end time and send it 
 to the RM via the heartbeat. This jira focuses on how the AppMaster performs 
 this estimation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5871) Estimate Job Endtime

2014-05-02 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAPREDUCE-5871:
---

Status: Open  (was: Patch Available)

 Estimate Job Endtime
 

 Key: MAPREDUCE-5871
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5871
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: MAPREDUCE-5871.patch


 YARN-1969 adds a new earliest-endtime-first policy to the fair scheduler. As 
 a prerequisite step, the AppMaster should estimate its end time and send it 
 to the RM via the heartbeat. This jira focuses on how the AppMaster performs 
 this estimation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5809) Enhance distcp to support preserving HDFS ACLs.

2014-05-02 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated MAPREDUCE-5809:
-

Attachment: MAPREDUCE-5809.3.patch

Thanks, Akira.  Here is patch version 3.  I cleaned up the unused imports in 
{{DistCpUtils}}.  However, I didn't see any unused imports in 
{{ScopedAclEntries}}.

 Enhance distcp to support preserving HDFS ACLs.
 ---

 Key: MAPREDUCE-5809
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5809
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: 2.4.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Attachments: MAPREDUCE-5809.1.patch, MAPREDUCE-5809.2.patch, 
 MAPREDUCE-5809.3.patch


 This issue tracks enhancing distcp to add a new command-line argument for 
 preserving HDFS ACLs from the source at the copy destination.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5866) TestFixedLengthInputFormat fails in windows

2014-05-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988081#comment-13988081
 ] 

Chris Nauroth commented on MAPREDUCE-5866:
--

Hi, [~vvasudev].  Nice catch.  For 
{{org.apache.hadoop.mapreduce.lib.input.TestFixedLengthInputFormat}}, there is 
still a small risk of a resource leak, because {{RecordReader#initialize}} can 
throw an exception.  I recommend moving that initialize call inside the try 
block.

 TestFixedLengthInputFormat fails in windows
 ---

 Key: MAPREDUCE-5866
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5866
 Project: Hadoop Map/Reduce
  Issue Type: Test
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1992.0.patch


 org.apache.hadoop.mapred.TextFixedLengthInputFormat and 
 org.apache.hadoop.mapreduce.lib.input.TestFixedLengthInputFormat tests fail 
 in Windows



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5809) Enhance distcp to support preserving HDFS ACLs.

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988159#comment-13988159
 ] 

Hadoop QA commented on MAPREDUCE-5809:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12643073/MAPREDUCE-5809.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs 
hadoop-tools/hadoop-distcp.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4578//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4578//console

This message is automatically generated.

 Enhance distcp to support preserving HDFS ACLs.
 ---

 Key: MAPREDUCE-5809
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5809
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: 2.4.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Attachments: MAPREDUCE-5809.1.patch, MAPREDUCE-5809.2.patch, 
 MAPREDUCE-5809.3.patch


 This issue tracks enhancing distcp to add a new command-line argument for 
 preserving HDFS ACLs from the source at the copy destination.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.4-2.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)