Hi Harsh,
Thanks a lot for your reply.
I added a predicate to my query to select a single partition in the table,
and tested with both "spark.sql.hive.metastorePartitionPruning" setting on
and off, and there is no difference in DataFrame creation time.
Yes, Michael's proposed workaround works.
Hi Michael,
We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
on cloudera), and Parquet formatted tables. I turned on spark
.sql.hive.metastorePartitionPruning=true, but DataFrame creation still
takes a long time.
Is there any other configuration to consider?
Thanks a
Also, do you mean two partitions or two partition columns? If there are
many partitions it can be much slower. In Spark 1.5 I'd consider
setting spark.sql.hive.metastorePartitionPruning=true
if you have predicates over the partition columns.
On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust
What format is this table. For parquet and other optimized formats we
cache a bunch of file metadata on first access to make interactive queries
faster.
On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote:
> Hello,
>
> I am using SparkSQL to query some Hive tables. Most of
Hi Michael,
Thanks a lot for your reply.
This table is stored as text file with tab delimited columns.
You are correct, the problem is because my table has too many partitions
(1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
If you run sqlContext.table("...").registerTempTable("...") that temptable
will cache the lookup of partitions.
On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan wrote:
> Hi Michael,
>
> Thanks a lot for your reply.
>
> This table is stored as text file with tab delimited
Hello,
I am using SparkSQL to query some Hive tables. Most of the time, when I
create a DataFrame using sqlContext.sql("select * from table") command,
DataFrame creation is less than 0.5 second.
But I have this one table with which it takes almost 12 seconds!
scala> val start =