[jira] [Updated] (MAPREDUCE-5867) Possible NPE in KillAMPreemptionPolicy related to ProportionalCapacityPreemptionPolicy

2014-04-30 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated MAPREDUCE-5867:
---

Attachment: MapReduce-5867.2.patch

Thank You Devaraj for the review.
1. I have updated patch as per the comment for local variable extraction.
2. As of now, there are no test classes available for testing the different AM 
Preemption policies.
May be creating a set of test cases for that feature can be tracked 
with another Jira. Pls suggest.

 Possible NPE in KillAMPreemptionPolicy related to 
 ProportionalCapacityPreemptionPolicy
 --

 Key: MAPREDUCE-5867
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5867
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: MapReduce-5867.2.patch, Yarn-1980.1.patch


 I configured KillAMPreemptionPolicy for My Application Master and tried to 
 check preemption of queues.
 In one scenario I have seen below NPE in my AM
 014-04-24 15:11:08,860 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.preemption.KillAMPreemptionPolicy.preempt(KillAMPreemptionPolicy.java:57)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:662)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:246)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:267)
   at java.lang.Thread.run(Thread.java:662)
 I was using 2.2.0 and merged MAPREDUCE-5189 to see how AM preemption works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5638) Port Hadoop Archives document to trunk

2014-04-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985390#comment-13985390
 ] 

Hudson commented on MAPREDUCE-5638:
---

SUCCESS: Integrated in Hadoop-Yarn-trunk #556 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/556/])
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via 
jeagles) (jeagles: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591107)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
* /hadoop/common/trunk/hadoop-project/src/site/site.xml


 Port Hadoop Archives document to trunk
 --

 Key: MAPREDUCE-5638
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5638
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: documentation
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
 Fix For: 3.0.0, 2.5.0

 Attachments: MAPREDUCE-5638-md.patch, MAPREDUCE-5638.patch


 Now Hadoop Archive document exists only in branch-1. Let's port Hadoop 
 Archives document to trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAPREDUCE-5103) Remove dead code QueueManager and JobEndNotifier

2014-04-30 Thread jhanver chand sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jhanver chand sharma reassigned MAPREDUCE-5103:
---

Assignee: jhanver chand sharma

 Remove dead code QueueManager and JobEndNotifier
 

 Key: MAPREDUCE-5103
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5103
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: jhanver chand sharma

 There are a few classes that are dead or duplicate code at this point.  
 org/apache/hadoop/mapred/JobEndNotifier.java
 org/apache/hadoop/mapred/QueueManager.java
 org/apache/hadoop/mapred/QueueConfigurationParser.java
 org/apache/hadoop/mapred/DeprecatedQueueConfigurationParser.java
 LocalRunner is currently using the JobEndNotifier, but there is a replacement 
 for in in MRv2 org.apache.hadoop.mapreduce.v2.app.JobEndNotifier.  The two 
 should be combined together and duplicate code removed.
 There appears to only be one method called for the QueueManger and it appears 
 to be setting a property that is not used any more, so it can be removed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5867) Possible NPE in KillAMPreemptionPolicy related to ProportionalCapacityPreemptionPolicy

2014-04-30 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985482#comment-13985482
 ] 

Devaraj K commented on MAPREDUCE-5867:
--

{quote}
2. As of now, there are no test classes available for testing the different AM 
Preemption policies.
 May be creating a set of test cases for that feature can be tracked with 
another Jira. Pls suggest.
{quote}

Can you add a new test class for writing test cases as part of this Jira 
itself, may not be needed to handle as part of another Jira.

 Possible NPE in KillAMPreemptionPolicy related to 
 ProportionalCapacityPreemptionPolicy
 --

 Key: MAPREDUCE-5867
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5867
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: MapReduce-5867.2.patch, Yarn-1980.1.patch


 I configured KillAMPreemptionPolicy for My Application Master and tried to 
 check preemption of queues.
 In one scenario I have seen below NPE in my AM
 014-04-24 15:11:08,860 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.preemption.KillAMPreemptionPolicy.preempt(KillAMPreemptionPolicy.java:57)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:662)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:246)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:267)
   at java.lang.Thread.run(Thread.java:662)
 I was using 2.2.0 and merged MAPREDUCE-5189 to see how AM preemption works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (MAPREDUCE-5870) Support for passing Job priority through Application Submission Context in Mapreduce Side

2014-04-30 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe moved YARN-2002 to MAPREDUCE-5870:
-

Component/s: (was: api)
 (was: resourcemanager)
 client
Key: MAPREDUCE-5870  (was: YARN-2002)
Project: Hadoop Map/Reduce  (was: Hadoop YARN)

 Support for passing Job priority through Application Submission Context in 
 Mapreduce Side
 -

 Key: MAPREDUCE-5870
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5870
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Reporter: Sunil G
 Attachments: Yarn-2002.1.patch


 Job Prioirty can be set from client side as below [Configuration and api].
   a.  JobConf.getJobPriority() and 
 Job.setPriority(JobPriority priority) 
   b.  We can also use configuration 
 mapreduce.job.priority.
   Now this Job priority can be passed in Application Submission 
 context from Client side.
   Here we can reuse the MRJobConfig.PRIORITY configuration. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAPREDUCE-5870) Support for passing Job priority through Application Submission Context in Mapreduce Side

2014-04-30 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned MAPREDUCE-5870:
-

Assignee: Sunil G

 Support for passing Job priority through Application Submission Context in 
 Mapreduce Side
 -

 Key: MAPREDUCE-5870
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5870
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-2002.1.patch


 Job Prioirty can be set from client side as below [Configuration and api].
   a.  JobConf.getJobPriority() and 
 Job.setPriority(JobPriority priority) 
   b.  We can also use configuration 
 mapreduce.job.priority.
   Now this Job priority can be passed in Application Submission 
 context from Client side.
   Here we can reuse the MRJobConfig.PRIORITY configuration. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5638) Port Hadoop Archives document to trunk

2014-04-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985545#comment-13985545
 ] 

Hudson commented on MAPREDUCE-5638:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk #1747 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1747/])
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via 
jeagles) (jeagles: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591107)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
* /hadoop/common/trunk/hadoop-project/src/site/site.xml


 Port Hadoop Archives document to trunk
 --

 Key: MAPREDUCE-5638
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5638
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: documentation
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
 Fix For: 3.0.0, 2.5.0

 Attachments: MAPREDUCE-5638-md.patch, MAPREDUCE-5638.patch


 Now Hadoop Archive document exists only in branch-1. Let's port Hadoop 
 Archives document to trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5638) Port Hadoop Archives document to trunk

2014-04-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985634#comment-13985634
 ] 

Hudson commented on MAPREDUCE-5638:
---

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1773 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1773/])
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via 
jeagles) (jeagles: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591107)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
* /hadoop/common/trunk/hadoop-project/src/site/site.xml


 Port Hadoop Archives document to trunk
 --

 Key: MAPREDUCE-5638
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5638
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: documentation
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
 Fix For: 3.0.0, 2.5.0

 Attachments: MAPREDUCE-5638-md.patch, MAPREDUCE-5638.patch


 Now Hadoop Archive document exists only in branch-1. Let's port Hadoop 
 Archives document to trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5870) Support for passing Job priority through Application Submission Context in Mapreduce Side

2014-04-30 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985729#comment-13985729
 ] 

Devaraj K commented on MAPREDUCE-5870:
--

Please refer TestTypeConverter.java for adding tests for this.

 Support for passing Job priority through Application Submission Context in 
 Mapreduce Side
 -

 Key: MAPREDUCE-5870
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5870
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: client
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-2002.1.patch


 Job Prioirty can be set from client side as below [Configuration and api].
   a.  JobConf.getJobPriority() and 
 Job.setPriority(JobPriority priority) 
   b.  We can also use configuration 
 mapreduce.job.priority.
   Now this Job priority can be passed in Application Submission 
 context from Client side.
   Here we can reuse the MRJobConfig.PRIORITY configuration. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (MAPREDUCE-5871) Estimate Job Endtime

2014-04-30 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh moved YARN-2006 to MAPREDUCE-5871:
--

Key: MAPREDUCE-5871  (was: YARN-2006)
Project: Hadoop Map/Reduce  (was: Hadoop YARN)

 Estimate Job Endtime
 

 Key: MAPREDUCE-5871
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5871
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: YARN-1969.patch


 YARN-1969 adds a new earliest-endtime-first policy to the fair scheduler. As 
 a prerequisite step, the AppMaster should estimate its end time and send it 
 to the RM via the heartbeat. This jira focuses on how the AppMaster performs 
 this estimation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5871) Estimate Job Endtime

2014-04-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985786#comment-13985786
 ] 

Hadoop QA commented on MAPREDUCE-5871:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642650/YARN-1969.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4569//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4569//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4569//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4569//console

This message is automatically generated.

 Estimate Job Endtime
 

 Key: MAPREDUCE-5871
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5871
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: YARN-1969.patch


 YARN-1969 adds a new earliest-endtime-first policy to the fair scheduler. As 
 a prerequisite step, the AppMaster should estimate its end time and send it 
 to the RM via the heartbeat. This jira focuses on how the AppMaster performs 
 this estimation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5871) Estimate Job Endtime

2014-04-30 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAPREDUCE-5871:
---

Attachment: MAPREDUCE-5871.patch

Submitting the patch (MAPREDUCE-5871.patch) that resolves the issues raised by 
bugfinder.

 Estimate Job Endtime
 

 Key: MAPREDUCE-5871
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5871
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: MAPREDUCE-5871.patch, YARN-1969.patch


 YARN-1969 adds a new earliest-endtime-first policy to the fair scheduler. As 
 a prerequisite step, the AppMaster should estimate its end time and send it 
 to the RM via the heartbeat. This jira focuses on how the AppMaster performs 
 this estimation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

2014-04-30 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985932#comment-13985932
 ] 

Ming Ma commented on MAPREDUCE-5652:


1. Regarding generic interface for restore/recover,  I agree there is no much 
benefit to generalize things for the sake of it. One scenario could be 
something like ShuffleHandler, some ShuffleHandlers support recovery, some 
don't. NM can ask if a specific ShuffleHandler if it supports recovery, NM will 
manage the underlying store and pass the store object to ShuffleHandler and 
ShuffleHandler manages the serialization and deserialization, etc. If NM 
decides to change the underlying store and ShuffleHandler doesn't need to 
change. But at this point, it seems unnecessary.
2. If ShuffleHandler gets DBException during recoverState as part of 
serviceStart, should ShuffleHandler ignore the exception and continue like the 
store doesn't exist? The argument for ignoring it is it is soft state and 
ShuffleHandler can still run without it. Or maybe this can be configurable.

 NM Recovery. ShuffleHandler should handle NM restarts
 -

 Key: MAPREDUCE-5652
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Jason Lowe
  Labels: shuffle
 Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
 MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch, 
 MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch, MAPREDUCE-5652.patch


 ShuffleHandler should work across NM restarts and not require re-running 
 map-tasks. On NM restart, the map outputs are cleaned up requiring 
 re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-04-30 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-5402:
--

Attachment: MAPREDUCE-5402.4.patch

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5402) DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE

2014-04-30 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986257#comment-13986257
 ] 

Tsuyoshi OZAWA commented on MAPREDUCE-5402:
---

Updates are as follows::
* Changed to use configuration in createSplits(..)
* Changed to use configuration in getSplitRatio(..)
* Added validattion in getMaxChunksTolerable, getMaxChunksIdeal and 
getMinRecordsPerChunk
* Added tests

 DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
 --

 Key: MAPREDUCE-5402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp, mrv2
Reporter: David Rosenstrauch
Assignee: Tsuyoshi OZAWA
 Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, 
 MAPREDUCE-5402.3.patch, MAPREDUCE-5402.4.patch


 In MAPREDUCE-2765, which provided the design spec for DistCpV2, the author 
 describes the implementation of DynamicInputFormat, with one of the main 
 motivations cited being to reduce the chance of long-tails where a few 
 leftover mappers run much longer than the rest.
 However, I today ran into a situation where I experienced exactly such a long 
 tail using DistCpV2 and DynamicInputFormat.  And when I tried to alleviate 
 the problem by overriding the number of mappers and the split ratio used by 
 the DynamicInputFormat, I was prevented from doing so by the hard-coded limit 
 set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently set to 400.)
 This constant is actually set quite low for production use.  (See a 
 description of my use case below.)  And although MAPREDUCE-2765 states that 
 this is an overridable maximum, when reading through the code there does 
 not actually appear to be any mechanism available to override it.
 This should be changed.  It should be possible to expand the maximum # of 
 chunks beyond this arbitrary limit.
 For example, here is the situation I ran into today:
 I ran a distcpv2 job on a cluster with 8 machines containing 128 map slots.  
 The job consisted of copying ~2800 files from HDFS to Amazon S3.  I overrode 
 the number of mappers for the job from the default of 20 to 128, so as to 
 more properly parallelize the copy across the cluster.  The number of chunk 
 files created was calculated as 241, and mapred.num.entries.per.chunk was 
 calculated as 12.
 As the job ran on, it reached a point where there were only 4 remaining map 
 tasks, which had each been running for over 2 hours.  The reason for this was 
 that each of the 12 files that those mappers were copying were quite large 
 (several hundred megabytes in size) and took ~20 minutes each.  However, 
 during this time, all the other 124 mappers sat idle.
 In theory I should be able to alleviate this problem with DynamicInputFormat. 
  If I were able to, say, quadruple the number of chunk files created, that 
 would have made each chunk contain only 3 files, and these large files would 
 have gotten distributed better around the cluster and copied in parallel.
 However, when I tried to do that - by overriding mapred.listing.split.ratio 
 to, say, 10 - DynamicInputFormat responded with an exception (Too many 
 chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease 
 split-ratio to proceed.) - presumably because I exceeded the 
 MAX_CHUNKS_TOLERABLE value of 400.
 Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I 
 can't personally see any.
 If this limit has no particular logic behind it, then it should be 
 overridable - or even better:  removed altogether.  After all, I'm not sure I 
 see any need for it.  Even if numMaps * splitRatio resulted in an 
 extraordinarily large number, if the code were modified so that the number of 
 chunks got calculated as Math.min( numMaps * splitRatio, numFiles), then 
 there would be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario 
 where the product of numMaps and splitRatio is large, capping the number of 
 chunks at the number of files (numberOfChunks = numberOfFiles) would result 
 in 1 file per chunk - the maximum parallelization possible.  That may not be 
 the best-tuned solution for some users, but I would think that it should be 
 left up to the user to deal with the potential consequence of not having 
 tuned their job properly.  Certainly that would be better than having an 
 arbitrary hard-coded limit that *prevents* proper parallelization when 
 dealing with large files and/or large numbers of mappers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5862) Line records longer than 2x split size aren't handled correctly

2014-04-30 Thread bc Wong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bc Wong updated MAPREDUCE-5862:
---

Attachment: 0001-MAPREDUCE-5862.-Line-records-longer-than-2x-split-si.patch

Thanks! Updated patch. Added an exception to the rat config.

 Line records longer than 2x split size aren't handled correctly
 ---

 Key: MAPREDUCE-5862
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5862
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: bc Wong
Priority: Critical
 Attachments: 0001-Handle-records-larger-than-2x-split-size.1.patch, 
 0001-Handle-records-larger-than-2x-split-size.patch, 
 0001-Handle-records-larger-than-2x-split-size.patch, 
 0001-MAPREDUCE-5862.-Line-records-longer-than-2x-split-si.patch, 
 recordSpanningMultipleSplits.txt.bz2


 Suppose this split (100-200) is in the middle of a record (90-240):
 {noformat}
0  100200 300
| split | curr | split |
  --- record ---
  90 240
 {noformat}
   
 Currently, the first split would read the entire record, up to offset 240, 
 which is good. But the 2nd split has a bug in producing a phantom record of 
 (200, 240).



--
This message was sent by Atlassian JIRA
(v6.2#6252)