[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251755#comment-14251755
 ] 

Xuefu Zhang commented on HIVE-9153:
-----------------------------------

Thanks for the findings, [~lirui]. I heard that the spark snapshot we are using 
is 2X slower than previous version. this might explain the slowness. Also, I 
think the number of mappers and locality matter in speed, but the two may 
collide with each other. For instance, if we have more executors than mappers, 
it's desirable to have more map tasks. However, doing so might impact locality 
because some mappers might read remotely. On the other hand, if there are more 
mappers than executors, then few mappers will help the speed.

Any way, it would be good to find out how Tez generates splits using 
HiveInputFormat. Also, we should fix HIVE-8722. Is there a way to disable 
Spark's delayed schedule to try out?

> Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
> ---------------------------------------------------------------------
>
>                 Key: HIVE-9153
>                 URL: https://issues.apache.org/jira/browse/HIVE-9153
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: spark-branch
>            Reporter: Brock Noland
>            Assignee: Rui Li
>         Attachments: screenshot.PNG
>
>
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
> However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
> Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
> should evaluate this on a query which has many input splits such as {{select 
> count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to