[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589527#comment-13589527
 ] 

Hudson commented on MAPREDUCE-4892:
---

Integrated in Hadoop-Mapreduce-trunk #1358 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1358/])
MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' 
allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912)

 Result = FAILURE
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 2.0.4-beta
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, 
> MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589503#comment-13589503
 ] 

Hudson commented on MAPREDUCE-4892:
---

Integrated in Hadoop-Hdfs-trunk #1330 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1330/])
MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' 
allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 2.0.4-beta
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, 
> MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589428#comment-13589428
 ] 

Hudson commented on MAPREDUCE-4892:
---

Integrated in Hadoop-Yarn-trunk #141 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/141/])
MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' 
allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 2.0.4-beta
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, 
> MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-02-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588631#comment-13588631
 ] 

Hudson commented on MAPREDUCE-4892:
---

Integrated in Hadoop-trunk-Commit #3391 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3391/])
MAPREDUCE-4892. Modify CombineFileInputFormat to not skew input slits' 
allocation on small clusters. Contributed by Bikas Saha. (Revision 1450912)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1450912
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileInputFormat.java


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, 
> MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-02-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588579#comment-13588579
 ] 

Hadoop QA commented on MAPREDUCE-4892:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12571219/MAPREDUCE-4892.1.alt.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 tests included appear to have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3369//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3369//console

This message is automatically generated.

> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.alt.patch, 
> MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566218#comment-13566218
 ] 

Hadoop QA commented on MAPREDUCE-4892:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12567108/MAPREDUCE-4892.1.alt.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3293//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3293//console

This message is automatically generated.

> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-29 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566182#comment-13566182
 ] 

Bikas Saha commented on MAPREDUCE-4892:
---

I meant unnecessary computations for nodes that are not going to produce more 
splits in every iteration. The code will try to group any remaining blocks and 
fail to produce a meaningful split. And it will do this again and again.

In the example, we could end up generating 5, 1, 3 or 4, 1, 4 splits by trying 
to be completely node local. Or we could go 3, 1, 3 and 2 rack-local. Upon 
scheduling the rack-local option would start 3, 3, 3 if there are 3 slots per 
node. The extra node local ones would have to wait  for nodes 1 and 3 to finish 
the first wave before they can run. Or wait for delay scheduling to kick in and 
get scheduled on node 2. Which of these will be faster is not obvious.

In any case, we can debate this endlessly. I have attached an alternative patch 
which iterates multiple times over nodes until it all blocks have been grouped 
or no node if left with more than minsize of blocks, while trying to keep 
number of iterations low.

> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.alt.patch, MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565792#comment-13565792
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4892:


In my example, the maxMapperSize is one split (the default).

bq. In general and specially in skewed distributions we will be running the 
inner for loop (that tries to group blocks) multiple times with no useful 
outcome.
I don't see how there will be iterations with no outcome. We will always try to 
allocate to some node or break if we cannot allocate any.

OTOH, we can improve on allocating one at a time by instead trying to allocate 
as much as possible but not more than the minimum across all nodes.

bq. Secondly, for large jobs with multi-wave mappers, effects of non-local 
reads or delay scheduling might be small compared to the overall cost 
(including multiple runs). So this change mainly boils down to effects on 
single mapper wave jobs or small jobs.
That's not true, locality affects large jobs too. Unless your definition of 
large means read less and process for a long time.

bq. In these cases, it is arguable that having a few rack local mappers (which 
significantly increases the on time start of the mapper) might be beneficial 
than having the mapper wait on a machine because it was node local to it. 
Likeliness of waiting is high since the machine has many mappers already 
assigned to it.
That's a general argument against node locality itself, one that is take care 
of by other mechanisms like locality-wait/delay.


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551474#comment-13551474
 ] 

Bikas Saha commented on MAPREDUCE-4892:
---

I did consider the alternate implementation. Arguments against it. In general 
and specially in skewed distributions we will be running the inner for loop 
(that tries to group blocks) multiple times with no useful outcome. So for say 
10K machines that might end up slowing down the grouping calculation.
Secondly, for large jobs with multi-wave mappers, effects of non-local reads or 
delay scheduling might be small compared to the overall cost (including 
multiple runs). So this change mainly boils down to effects on single mapper 
wave jobs or small jobs. In these cases, it is arguable that having a few rack 
local mappers (which significantly increases the on time start of the mapper) 
might be beneficial than having the mapper wait on a machine because it was 
node local to it. Likeliness of waiting is high since the machine has many 
mappers already assigned to it.

That being said, I did not fully get your example. What is the maxMapperSize? 
If its 3 blocks then how can there be 9 splits created? If its not 3 blocks 
then why would avg per node be = 3?

> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-03 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543377#comment-13543377
 ] 

Chris Nauroth commented on MAPREDUCE-4892:
--

+1 for the patch

I applied the patch locally and ran {{TestCombineFileInputFormat}}.  Code looks 
good.  I can't think of any other edge cases that this patch doesn't handle.


> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2012-12-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537523#comment-13537523
 ] 

Hadoop QA commented on MAPREDUCE-4892:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562000/MAPREDUCE-4892.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3152//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3152//console

This message is automatically generated.

> CombineFileInputFormat node input split can be skewed on small clusters
> ---
>
> Key: MAPREDUCE-4892
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4892.1.patch
>
>
> The CombineFileInputFormat split generation logic tries to group blocks by 
> node in order to create splits. It iterates through the nodes and creates 
> splits on them until there aren't enough blocks left on a node that can be 
> grouped into a valid split. If the first few nodes have a lot of blocks on 
> them then they can end up getting a disproportionately large share of the 
> total number of splits created. This can result in poor locality of maps. 
> This problem is likely to happen on small clusters where its easier to create 
> a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira