Cheolsoo Park created SPARK-9926:
------------------------------------

             Summary: Parallelize file listing for partitioned Hive table
                 Key: SPARK-9926
                 URL: https://issues.apache.org/jira/browse/SPARK-9926
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.4.1, 1.5.0
            Reporter: Cheolsoo Park


In Spark SQL, short queries like {{select * from table limit 10}} run very 
slowly against partitioned Hive tables because of file listing. In particular, 
if a large number of partitions are scanned on storage like S3, the queries run 
extremely slowly. Here are some example benchmarks in my environment-

* Parquet-backed Hive table
* Partitioned by dateint and hour
* Stored on S3

||\# of partitions||\# of files||runtime||query||
|1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
10;|
|24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
|240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
dateint<=20150610 limit 10;|

The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
partition path and group them into a UnionRDD. Then, all the input files are 
listed sequentially. In other tools such as Hive and Pig, this can be solved by 
setting 
[mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
 high. But in Spark, since each HadoopRDD lists only one partition path, 
setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to