[
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252762#comment-14252762
]
Rui Li commented on HIVE-9153:
------------------------------
Hi [~xuefuz] - if the spark cluster is the same as the hadoop cluster i.e. each
executor is also a datanode, spark task scheduler usually does a good job to
make sure all mappers have some locality (of course on condition that the
mappers do specify a preferred location). In such case, more mappers won't
impact data locality.
bq. Is there a way to disable Spark's delayed schedule to try out?
Spark task scheduler divides tasks into multiple lists according to locality
level and attempts to launch tasks with highest locality level when an executor
offers resources. It may also wait some time to schedule tasks in a lower
level. I don't think there's a switch to turn it off. Actually I'm not 100%
sure it's the delay schedule causing the issue. If all our tasks don't have
preferred location, the delay may happen at start-up (waiting allowed locality
level to drop) but not during execution. I'll look more into this.
> Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
> ---------------------------------------------------------------------
>
> Key: HIVE-9153
> URL: https://issues.apache.org/jira/browse/HIVE-9153
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: spark-branch
> Reporter: Brock Noland
> Assignee: Rui Li
> Attachments: screenshot.PNG
>
>
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this.
> However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in
> Spark, it might make sense for us to use {{HiveInputFormat}} as well. We
> should evaluate this on a query which has many input splits such as {{select
> count(\*) from store_sales where something is not null}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)