If you run sqlContext.table("...").registerTempTable("...") that temptable will cache the lookup of partitions.
On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote: > Hi Michael, > > Thanks a lot for your reply. > > This table is stored as text file with tab delimited columns. > > You are correct, the problem is because my table has too many partitions > (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984 > <https://issues.apache.org/jira/browse/SPARK-6984>. > > Not sure when my company can move to 1.5. Would you know some workaround > for this bug? > If I cannot find workaround for this, will have to change our schema > design to reduce number of partitions. > > > Thanks, > > Isabelle > > > > On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Also, do you mean two partitions or two partition columns? If there are >> many partitions it can be much slower. In Spark 1.5 I'd consider setting >> spark.sql.hive.metastorePartitionPruning=true >> if you have predicates over the partition columns. >> >> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> What format is this table. For parquet and other optimized formats we >>> cache a bunch of file metadata on first access to make interactive queries >>> faster. >>> >>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> I am using SparkSQL to query some Hive tables. Most of the time, when I >>>> create a DataFrame using sqlContext.sql("select * from table") command, >>>> DataFrame creation is less than 0.5 second. >>>> But I have this one table with which it takes almost 12 seconds! >>>> >>>> scala> val start = scala.compat.Platform.currentTime; val logs = >>>> sqlContext.sql("select * from temp.log"); val execution = >>>> scala.compat.Platform.currentTime - start >>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from >>>> temp.log >>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed >>>> start: Long = 1441336022731 >>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int, >>>> log_time: string, tag: string, dt: string, test_id: int] >>>> execution: Long = *11567* >>>> >>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns). >>>> I have created DataFrames on even larger tables and do not see such >>>> delay. >>>> So my questions are: >>>> - What can impact DataFrame creation time? >>>> - Is it related to the table partitions? >>>> >>>> >>>> Thanks much your help! >>>> >>>> Isabelle >>>> >>> >>> >> >