Hi Harsh, Thanks a lot for your reply.
I added a predicate to my query to select a single partition in the table, and tested with both "spark.sql.hive.metastorePartitionPruning" setting on and off, and there is no difference in DataFrame creation time. Yes, Michael's proposed workaround works. But I was under the impression that this workaround was only for Spark version < 1.5. With the Hive metastore partition pruning feature from Spark 1.5, I thought there would be no more delay, so I could create DataFrames left and right. I noticed that regardless of the setting, when I create a DataFrame with or without a predicate, I get a log message from HadoopFsRelation class which lists hdfs filepaths to ALL partitions in my table (see logInfo call in the code <https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L770>). Is this expected? I am not sure who is creating this Array of filepaths, but I am guessing this is the source of my delay. Thanks, Isabelle On Thu, Dec 10, 2015 at 7:38 PM, Harsh J <ha...@cloudera.com> wrote: > The option of "spark.sql.hive.metastorePartitionPruning=true" will not > work unless you have a partition column predicate in your query. Your query > of "select * from temp.log" does not do this. The slowdown appears to be > due to the need of loading all partition metadata. > > Have you also tried to see if Michael's temp-table suggestion helps you > cache the expensive partition lookup? (re-quoted below) > > """ > If you run sqlContext.table("...").registerTempTable("...") that temptable > will cache the lookup of partitions [the first time is slow, but subsequent > lookups will be faster]. > """ - X-Ref: Permalink > <https://issues.apache.org/jira/browse/SPARK-6910?focusedCommentId=14529666&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14529666> > > Also, do you absolutely need to use "select * from temp.log"? Adding a > where clause to the query with a partition condition will help Spark prune > the request to just the required partitions (vs. all, which is proving > expensive). > > On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gmail.com> wrote: > >> Hi Michael, >> >> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are >> on cloudera), and Parquet formatted tables. I turned on spark >> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still >> takes a long time. >> Is there any other configuration to consider? >> >> >> Thanks a lot for your help, >> >> Isabelle >> >> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> If you run sqlContext.table("...").registerTempTable("...") that >>> temptable will cache the lookup of partitions. >>> >>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote: >>> >>>> Hi Michael, >>>> >>>> Thanks a lot for your reply. >>>> >>>> This table is stored as text file with tab delimited columns. >>>> >>>> You are correct, the problem is because my table has too many >>>> partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting >>>> bug 6984 <https://issues.apache.org/jira/browse/SPARK-6984>. >>>> >>>> Not sure when my company can move to 1.5. Would you know some >>>> workaround for this bug? >>>> If I cannot find workaround for this, will have to change our schema >>>> design to reduce number of partitions. >>>> >>>> >>>> Thanks, >>>> >>>> Isabelle >>>> >>>> >>>> >>>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> Also, do you mean two partitions or two partition columns? If there >>>>> are many partitions it can be much slower. In Spark 1.5 I'd consider >>>>> setting spark.sql.hive.metastorePartitionPruning=true if you have >>>>> predicates over the partition columns. >>>>> >>>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> What format is this table. For parquet and other optimized formats >>>>>> we cache a bunch of file metadata on first access to make interactive >>>>>> queries faster. >>>>>> >>>>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I am using SparkSQL to query some Hive tables. Most of the time, >>>>>>> when I create a DataFrame using sqlContext.sql("select * from >>>>>>> table") command, DataFrame creation is less than 0.5 second. >>>>>>> But I have this one table with which it takes almost 12 seconds! >>>>>>> >>>>>>> scala> val start = scala.compat.Platform.currentTime; val logs = >>>>>>> sqlContext.sql("select * from temp.log"); val execution = >>>>>>> scala.compat.Platform.currentTime - start >>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from >>>>>>> temp.log >>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed >>>>>>> start: Long = 1441336022731 >>>>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: >>>>>>> int, log_time: string, tag: string, dt: string, test_id: int] >>>>>>> execution: Long = *11567* >>>>>>> >>>>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id >>>>>>> columns). >>>>>>> I have created DataFrames on even larger tables and do not see such >>>>>>> delay. >>>>>>> So my questions are: >>>>>>> - What can impact DataFrame creation time? >>>>>>> - Is it related to the table partitions? >>>>>>> >>>>>>> >>>>>>> Thanks much your help! >>>>>>> >>>>>>> Isabelle >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>