[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

Rui Li (JIRA) Fri, 19 Dec 2014 01:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253164#comment-14253164
 ]


Rui Li commented on HIVE-9153:
------------------------------

Investigated a bit about why {{CombineHiveInputFormat.getLocations}} might 
return null.
{{CombineFileInputFormat}} first tries to combine blocks within each node to 
generate splits. These splits will have that single node as preferred location. 
There's also a target size of each split 
({{mapreduce.input.fileinputformat.split.maxsize}}) and 
{{CombineFileInputFormat}} will try to make sure each split reaches that size. 
Therefore, if blocks left on a node don't add up to that size, they might be 
further combined on rack level:
{code}
            // haven't created any split on this machine. so its ok to add a
            // smaller one for parallelism. Otherwise group it in the rack for
            // balanced size create an input split and add it to the splits
            // array
           ......
            // Put the unplaced blocks back into the pool for later 
rack-allocation.
            for (OneBlockInfo oneblock : validBlocks) {
              blockToNodes.put(oneblock, oneblock.hosts);
            }
{code}
On rack level, preferred locations consist of all nodes in that rack. Since my 
cluster don't have rack to node mapping, the preferred locs is null. Such tasks 
may slow down the query, but they should only take up a small portion of total 
tasks.
I'll look how tez combines the blocks.

> Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
> ---------------------------------------------------------------------
>
>                 Key: HIVE-9153
>                 URL: https://issues.apache.org/jira/browse/HIVE-9153
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: spark-branch
>            Reporter: Brock Noland
>            Assignee: Rui Li
>         Attachments: screenshot.PNG
>
>
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
> However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
> Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
> should evaluate this on a query which has many input splits such as {{select 
> count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

Reply via email to