Re: storing query object

2016-01-22 Thread Ted Yu
There have been optimizations in this area, such as: https://issues.apache.org/jira/browse/SPARK-8125 You can also look at parent issue. Which Spark release are you using ? > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta > wrote: > > > Hi, > > I have a SPARK

Re: storing query object

2016-01-22 Thread Gourav Sengupta
Hi Ted, I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in TSV format. I do not see any affect of the work already done on this for the data stored in HIVE as it takes around 50 mins just to collect the table metadata over a 40 node cluster and the time is much the same

Fwd: storing query object

2016-01-22 Thread Gourav Sengupta
Hi, I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files. When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. Is

Re: storing query object

2016-01-22 Thread Ted Yu
In SQLConf.scala , I found this: val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf( key = "spark.sql.sources.parallelPartitionDiscovery.threshold", defaultValue = Some(32), doc = "The degree of parallelism for schema merging and partition discovery of " + "Parquet data

storing query object

2016-01-19 Thread Gourav Sengupta
Hi, I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files. When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. Is