Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan
Hi Harsh, Thanks a lot for your reply. I added a predicate to my query to select a single partition in the table, and tested with both "spark.sql.hive.metastorePartitionPruning" setting on and off, and there is no difference in DataFrame creation time. Yes, Michael's proposed workaround works.

Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan
Hi Michael, We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are on cloudera), and Parquet formatted tables. I turned on spark .sql.hive.metastorePartitionPruning=true, but DataFrame creation still takes a long time. Is there any other configuration to consider? Thanks a

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
Also, do you mean two partitions or two partition columns? If there are many partitions it can be much slower. In Spark 1.5 I'd consider setting spark.sql.hive.metastorePartitionPruning=true if you have predicates over the partition columns. On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
What format is this table. For parquet and other optimized formats we cache a bunch of file metadata on first access to make interactive queries faster. On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote: > Hello, > > I am using SparkSQL to query some Hive tables. Most of

Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan
Hi Michael, Thanks a lot for your reply. This table is stored as text file with tab delimited columns. You are correct, the problem is because my table has too many partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
If you run sqlContext.table("...").registerTempTable("...") that temptable will cache the lookup of partitions. On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan wrote: > Hi Michael, > > Thanks a lot for your reply. > > This table is stored as text file with tab delimited

DataFrame creation delay?

2015-09-03 Thread Isabelle Phan
Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds! scala> val start =