In SQLConf.scala , I found this:
val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
defaultValue = Some(32),
doc = "The degree of parallelism for schema merging and partition
discovery of " +
"Parquet data sourc
Hi Ted,
I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in
TSV format.
I do not see any affect of the work already done on this for the data
stored in HIVE as it takes around 50 mins just to collect the table
metadata over a 40 node cluster and the time is much the same f
There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125
You can also look at parent issue.
Which Spark release are you using ?
> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta
> wrote:
>
>
> Hi,
>
> I have a SPARK table (created from hiveContext)
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is t
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is t