[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524808#comment-14524808
 ] 

Hadoop QA commented on MAPREDUCE-5611:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12613866/CombineFileInputFormat-trunk.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5545/console |


This message was automatically generated.

> CombineFileInputFormat only requests a single location per split when more 
> could be optimal
> ---
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Chandra Prakash Bhagtani
>Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a 
> hive query on approx 1.2 GB data with CombineHiveInputFormat which internally 
> uses CombineFileInputFormat. My cluster size is 9 datanodes and 
> max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates 
> all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local 
> and 1 data local tasks. 
>  When replication factor is 9 (equal to cluster size), all the tasks should 
> be data-local as each datanode contains all the replicas of the input data, 
> but that is not happening i.e all the tasks are rack-local. 
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I 
> found the issue with the following snippet (specially in case of higher 
> replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator  List>> iter = nodeToBlocks.entrySet().iterator();
>  iter.hasNext();) {
>Map.Entry> one = iter.next();
>   nodes.add(one.getKey());
>   List blocksInNode = one.getValue();
>   // for each block, copy it into validBlocks. Delete it from
>   // blockToNodes so that the same block does not appear in
>   // two different splits.
>   for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
>   validBlocks.add(oneblock);
>   blockToNodes.remove(oneblock);
>   curSplitSize += oneblock.length;
>   // if the accumulated split size exceeds the maximum, then
>   // create this split.
>   if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
>   }
> }
>   }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the 
> above code creates 6 splits all with only one location. Now if JT doesn't 
> schedule these tasks on that node, all the tasks will be rack-local, even 
> though all the other datanodes have all the other replicas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524745#comment-14524745
 ] 

Hadoop QA commented on MAPREDUCE-5611:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12613866/CombineFileInputFormat-trunk.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5536/console |


This message was automatically generated.

> CombineFileInputFormat only requests a single location per split when more 
> could be optimal
> ---
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Chandra Prakash Bhagtani
>Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a 
> hive query on approx 1.2 GB data with CombineHiveInputFormat which internally 
> uses CombineFileInputFormat. My cluster size is 9 datanodes and 
> max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates 
> all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local 
> and 1 data local tasks. 
>  When replication factor is 9 (equal to cluster size), all the tasks should 
> be data-local as each datanode contains all the replicas of the input data, 
> but that is not happening i.e all the tasks are rack-local. 
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I 
> found the issue with the following snippet (specially in case of higher 
> replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator  List>> iter = nodeToBlocks.entrySet().iterator();
>  iter.hasNext();) {
>Map.Entry> one = iter.next();
>   nodes.add(one.getKey());
>   List blocksInNode = one.getValue();
>   // for each block, copy it into validBlocks. Delete it from
>   // blockToNodes so that the same block does not appear in
>   // two different splits.
>   for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
>   validBlocks.add(oneblock);
>   blockToNodes.remove(oneblock);
>   curSplitSize += oneblock.length;
>   // if the accumulated split size exceeds the maximum, then
>   // create this split.
>   if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
>   }
> }
>   }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the 
> above code creates 6 splits all with only one location. Now if JT doesn't 
> schedule these tasks on that node, all the tasks will be rack-local, even 
> though all the other datanodes have all the other replicas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal

2014-01-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863446#comment-13863446
 ] 

Sandy Ryza commented on MAPREDUCE-5611:
---

bq. What if, JT is not able to schedule tasks on this node (slot limitation 
etc). Then it will pick any random node and schedule the task (having all the 
blocks non local).
That's right.  However, picking a node with a small fraction of the input data 
is not much better than picking a node without any of the input data.  It is 
only useful to place a task on a node if the majority of the data is on that 
node.  There may be more optimal approaches to this that take into account the 
number of bytes on each node, but I think using the intersection is a good 
start that we know will not cause perf regressions.

bq. What if there is no intersection i.e common nodes for blocks in a split?
The change was proposed to affect the code where we are building splits out of 
the nodeToBlocks map.  In this part of the split creation process, there will 
always be an intersection because the blocks are all chosen from a specific 
node.

> CombineFileInputFormat only requests a single location per split when more 
> could be optimal
> ---
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Chandra Prakash Bhagtani
>Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a 
> hive query on approx 1.2 GB data with CombineHiveInputFormat which internally 
> uses CombineFileInputFormat. My cluster size is 9 datanodes and 
> max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates 
> all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local 
> and 1 data local tasks. 
>  When replication factor is 9 (equal to cluster size), all the tasks should 
> be data-local as each datanode contains all the replicas of the input data, 
> but that is not happening i.e all the tasks are rack-local. 
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I 
> found the issue with the following snippet (specially in case of higher 
> replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator  List>> iter = nodeToBlocks.entrySet().iterator();
>  iter.hasNext();) {
>Map.Entry> one = iter.next();
>   nodes.add(one.getKey());
>   List blocksInNode = one.getValue();
>   // for each block, copy it into validBlocks. Delete it from
>   // blockToNodes so that the same block does not appear in
>   // two different splits.
>   for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
>   validBlocks.add(oneblock);
>   blockToNodes.remove(oneblock);
>   curSplitSize += oneblock.length;
>   // if the accumulated split size exceeds the maximum, then
>   // create this split.
>   if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
>   }
> }
>   }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the 
> above code creates 6 splits all with only one location. Now if JT doesn't 
> schedule these tasks on that node, all the tasks will be rack-local, even 
> though all the other datanodes have all the other replicas.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal

2014-01-03 Thread Chandra Prakash Bhagtani (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861418#comment-13861418
 ] 

Chandra Prakash Bhagtani commented on MAPREDUCE-5611:
-

Agreed, common nodes (intersection) should be placed in the beginning of the 
locations set, but I feel having all the locations of all the blocks in 
locations set will help scheduling and data locality. Two concerns

1.  In your example, one node having all the 1000 replicas. What if, JT is not 
able to schedule tasks on this node (slot limitation etc). Then it will pick 
any random node and schedule the task (having all the blocks non local).

2. What if there is no intersection i.e common nodes for blocks in a split?

> CombineFileInputFormat only requests a single location per split when more 
> could be optimal
> ---
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Chandra Prakash Bhagtani
>Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a 
> hive query on approx 1.2 GB data with CombineHiveInputFormat which internally 
> uses CombineFileInputFormat. My cluster size is 9 datanodes and 
> max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates 
> all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local 
> and 1 data local tasks. 
>  When replication factor is 9 (equal to cluster size), all the tasks should 
> be data-local as each datanode contains all the replicas of the input data, 
> but that is not happening i.e all the tasks are rack-local. 
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I 
> found the issue with the following snippet (specially in case of higher 
> replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator  List>> iter = nodeToBlocks.entrySet().iterator();
>  iter.hasNext();) {
>Map.Entry> one = iter.next();
>   nodes.add(one.getKey());
>   List blocksInNode = one.getValue();
>   // for each block, copy it into validBlocks. Delete it from
>   // blockToNodes so that the same block does not appear in
>   // two different splits.
>   for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
>   validBlocks.add(oneblock);
>   blockToNodes.remove(oneblock);
>   curSplitSize += oneblock.length;
>   // if the accumulated split size exceeds the maximum, then
>   // create this split.
>   if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
>   }
> }
>   }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the 
> above code creates 6 splits all with only one location. Now if JT doesn't 
> schedule these tasks on that node, all the tasks will be rack-local, even 
> though all the other datanodes have all the other replicas.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only requests a single location per split when more could be optimal

2013-12-13 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848164#comment-13848164
 ] 

Rajesh Balamohan commented on MAPREDUCE-5611:
-

Agreed, ideal will be to compute and have the *intersection* of the nodes in 
the split information.  We will modify the patch to accommodate this and post 
the details.

> CombineFileInputFormat only requests a single location per split when more 
> could be optimal
> ---
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: Chandra Prakash Bhagtani
>Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a 
> hive query on approx 1.2 GB data with CombineHiveInputFormat which internally 
> uses CombineFileInputFormat. My cluster size is 9 datanodes and 
> max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates 
> all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local 
> and 1 data local tasks. 
>  When replication factor is 9 (equal to cluster size), all the tasks should 
> be data-local as each datanode contains all the replicas of the input data, 
> but that is not happening i.e all the tasks are rack-local. 
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I 
> found the issue with the following snippet (specially in case of higher 
> replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator  List>> iter = nodeToBlocks.entrySet().iterator();
>  iter.hasNext();) {
>Map.Entry> one = iter.next();
>   nodes.add(one.getKey());
>   List blocksInNode = one.getValue();
>   // for each block, copy it into validBlocks. Delete it from
>   // blockToNodes so that the same block does not appear in
>   // two different splits.
>   for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
>   validBlocks.add(oneblock);
>   blockToNodes.remove(oneblock);
>   curSplitSize += oneblock.length;
>   // if the accumulated split size exceeds the maximum, then
>   // create this split.
>   if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
>   }
> }
>   }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the 
> above code creates 6 splits all with only one location. Now if JT doesn't 
> schedule these tasks on that node, all the tasks will be rack-local, even 
> though all the other datanodes have all the other replicas.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)