[
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DB Tsai updated SPARK-9926:
---------------------------
Assignee: Cheolsoo Park
> Parallelize file listing for partitioned Hive table
> ---------------------------------------------------
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.1, 1.5.0
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very
> slowly against partitioned Hive tables because of file listing. In
> particular, if a large number of partitions are scanned on storage like S3,
> the queries run extremely slowly. Here are some example benchmarks in my
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive
> partition path and group them into a UnionRDD. Then, all the input files are
> listed sequentially. In other tools such as Hive and Pig, this can be solved
> by setting
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
> high. But in Spark, since each HadoopRDD lists only one partition path,
> setting this property doesn't help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]