latency to initialize a DataFrame from
partitioned parquet database.
Do you think it might be faster to put all the files in one directory
but still partitioned the same way? I don't actually need to filter on
the values of the partition keys, but I need to rely on there be no
overlap
Do you think it might be faster to put all the files in one directory but
still partitioned the same way? I don't actually need to filter on the
values of the partition keys, but I need to rely on there be no overlap in
the value of the keys between any two parquet files.
On Fri, Aug 7, 2015 at
: Re: Very high latency to initialize a DataFrame from partitioned
parquet database.
Do you think it might be faster to put all the files in one directory but still
partitioned the same way? I don't actually need to filter on the values of the
partition keys, but I need to rely
However, it's weird that the partition discovery job only spawns 2
tasks. It should use the default parallelism, which is probably 8
according to the logs of the next Parquet reading job. Partition
discovery is already done in a distributed manner via a Spark job. But
the parallelism is
Hi Philip,
Thanks for providing the log file. It seems that most of the time are
spent on partition discovery. The code snippet you provided actually
issues two jobs. The first one is for listing the input directories to
find out all leaf directories (and this actually requires listing all
Thanks, I also confirmed that the partition discovery is slow by writing a
non-Spark application that uses the parquet library directly to load that
partitions.
It's so slow that my colleague's Python application can read the entire
contents of all the parquet data files faster than my
With DEBUG, the log output was over 10MB, so I opted for just INFO output.
The (sanitized) log is attached.
The driver is essentially this code:
info(A)
val t = System.currentTimeMillis
val df = sqlContext.read.parquet(dir).select(...).cache
val elapsed =
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried
again.
The initialization time is about 1 minute now, which is still pretty
terrible.
On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver philip.wea...@gmail.com
wrote:
Absolutely, thanks!
On Wed, Aug 5, 2015 at 9:07 PM,
Would you mind to provide the driver log?
On 8/6/15 3:58 PM, Philip Weaver wrote:
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and
tried again.
The initialization time is about 1 minute now, which is still pretty
terrible.
On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver
Absolutely, thanks!
On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian lian.cs@gmail.com wrote:
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
Could you give it a shot to see whether it helps in your case? We've
observed ~50x performance boost with schema merging turned
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
Could you give it a shot to see whether it helps in your case? We've
observed ~50x performance boost with schema merging turned on.
Cheng
On 8/6/15 8:26 AM, Philip Weaver wrote:
I have a parquet directory that was
I have a parquet directory that was produced by partitioning by two keys,
e.g. like this:
df.write.partitionBy(a, b).parquet(asdf)
There are 35 values of a, and about 1100-1200 values of b for each
value of a, for a total of over 40,000 partitions.
Before running any transformations or actions
12 matches
Mail list logo