[
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251755#comment-14251755
]
Xuefu Zhang commented on HIVE-9153:
-----------------------------------
Thanks for the findings, [~lirui]. I heard that the spark snapshot we are using
is 2X slower than previous version. this might explain the slowness. Also, I
think the number of mappers and locality matter in speed, but the two may
collide with each other. For instance, if we have more executors than mappers,
it's desirable to have more map tasks. However, doing so might impact locality
because some mappers might read remotely. On the other hand, if there are more
mappers than executors, then few mappers will help the speed.
Any way, it would be good to find out how Tez generates splits using
HiveInputFormat. Also, we should fix HIVE-8722. Is there a way to disable
Spark's delayed schedule to try out?
> Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
> ---------------------------------------------------------------------
>
> Key: HIVE-9153
> URL: https://issues.apache.org/jira/browse/HIVE-9153
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: spark-branch
> Reporter: Brock Noland
> Assignee: Rui Li
> Attachments: screenshot.PNG
>
>
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this.
> However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in
> Spark, it might make sense for us to use {{HiveInputFormat}} as well. We
> should evaluate this on a query which has many input splits such as {{select
> count(\*) from store_sales where something is not null}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)