[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-06 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated MAPREDUCE-5402:
---

   Resolution: Fixed
Fix Version/s: 2.5.0
   Status: Resolved  (was: Patch Available)

I have committed this.  Thanks, Tsuyoshi!

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Fix For: 2.5.0

 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch, 
 MAPREDUCE-5402.5.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.5.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch, 
 MAPREDUCE-5402.5.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-05 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated MAPREDUCE-5402:
---

Hadoop Flags: Reviewed

+1 the new patch looks good.

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch, 
 MAPREDUCE-5402.5.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Open  (was: Patch Available)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Patch Available  (was: Open)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.4-2.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Open  (was: Patch Available)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Status: Patch Available  (was: Open)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.4.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-05-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: (was: MAPREDUCE-5402.4.patch)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-04-30 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.4.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2013-07-23 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.2.patch

Fixed to pass findbugs warnings.

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2013-07-23 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.3.patch

Fixed to pass compile.

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2013-07-19 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.1.patch

Thanks for reporting, David.
I attached the early patch to make MAX_CHUNKS_TOLERABLE configurable. I think 
we should discuss some points - MAX_CHUNKS_IDEAL_DEFAULT, 
MIN_RECORDS_PER_CHUNK_DEFAULT, SPLIT_RATIO_DEFAULT should be also configurable 
or calculated from the other parameters? I don't also know why these magic 
numbers are defined as current values.

[~mithun], what do you think?

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
 Attachments: MAPREDUCE-5402.1.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2013-07-19 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Assignee: Tsuyoshi OZAWA
  Status: Patch Available  (was: Open)

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2013-07-18 Thread David Rosenstrauch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Rosenstrauch updated MAPREDUCE-5402:
--

Description: 
In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
describes the implementation of DynamicInputFormat, with one of the main 
motivations cited being to reduce the chance of long-tails where a few leftover 
mappers run much longer than the rest.

However, I today ran into a situation where I experienced exactly such a long 
tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate the 
problem by overriding the number of mappers and the split ratio used by the 
DynamicInputFormat, I was prevented from doing so by the hard-coded limit set 
in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)

This constant is actually set quite low for production use.  (See a description 
of my use case below.)  And although MAPREDUCE-2765 states that this is an 
overridable maximum, when reading through the code there does not actually 
appear to be any mechanism available to override it.

This should be changed.  It should be possible to expand the maximum # of 
chunks beyond this arbitrary limit.


For example, here is the situation I ran into today:

I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
the number of mappers for the job from the default of 20 to 128, so as to more 
properly parallelize the copy across the cluster.  The number of chunk files 
created was calculated as 241, and mapred.num.entries.per.chunk was calculated 
as 12.

As the job ran on, it reached a point where there were only 4 remaining map 
tasks, which had each been running for over 2 hours.  The reason for this was 
that each of the 12 files that those mappers were copying were quite large 
(several hundred megabytes in size) and took ~20 minutes each.  However, during 
this time, all the other 124 mappers sat idle.


In theory I should be able to alleviate this problem with DynamicInputFormat.  
If I were able to, say, quadruple the number of chunk files created, that would 
have made each chunk contain only 3 files, and these large files would have 
gotten distributed better around the cluster and copied in parallel.

However, when I tried to do that - by overriding mapred.listing.split.ratio to, 
say, 10 - DynamicInputFormat responded with an exception (Too many chunks 
created with splitRatio:10, numMaps:128. Reduce numMaps or decrease split-ratio 
to proceed.) - presumably because I exceeded the MAX_CHUNKS_TOLERABLE value of 
400.


Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I can't 
personally see any.

If this limit has no particular logic behind it, then it should be overridable 
- or even better:  removed altogether.  After all, I'm not sure I see any need 
for it.  Even if numMaps * splitRatio resulted in an extraordinarily large 
number, if the code were modified so that the number of chunks got calculated 
as Math.min( numMaps * splitRatio, numFiles), then there would be no need for 
MAX_CHUNKS_TOLERABLE.  In this worst-case scenario where the product of numMaps 
and splitRatio is large, capping the number of chunks at the number of files 
(numberOfChunks = numberOfFiles) would result in 1 file per chunk - the maximum 
parallelization possible.  That may not be the best-tuned solution for some 
users, but I would think that it should be left up to the user to deal with the 
potential consequence of not having tuned their job properly.  Certainly that 
would be better than having an arbitrary hard-coded limit that *prevents* 
proper parallelization when dealing with large files and/or large numbers of 
mappers.

  was:
In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
describes the implementation of DynamicInputFormat, with one of the main 
motivations cited being to reduce the chance of long-tails where a few leftover 
mappers run much longer than the rest.

However, I today ran into a situation where I experienced exactly such a long 
tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate the 
problem by overriding the number of mappers and the split ratio used by the 
DynamicInputFormat, I was prevented from doing so by the hard-coded limit set 
in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)

This constant is actually set quite low for production use.  (See a description 
of my use case below.)  And although MAPREDUCE-2765 states that this is an 
overridable maximum, when reading through the code there does not actually 
appear to be any mechanism available to override it.

This should be changed.  It should be possible to expand the maximum # of 
chunks beyond this arbitrary limit.


For example, here is the situation I ran into