Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Philip Weaver Thu, 06 Aug 2015 01:00:47 -0700

I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried
again.


The initialization time is about 1 minute now, which is still pretty
terrible.

On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver <philip.wea...@gmail.com>
wrote:

> Absolutely, thanks!
>
> On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
>>
>> Could you give it a shot to see whether it helps in your case? We've
>> observed ~50x performance boost with schema merging turned on.
>>
>> Cheng
>>
>>
>> On 8/6/15 8:26 AM, Philip Weaver wrote:
>>
>> I have a parquet directory that was produced by partitioning by two keys,
>> e.g. like this:
>>
>> df.write.partitionBy("a", "b").parquet("asdf")
>>
>>
>> There are 35 values of "a", and about 1100-1200 values of "b" for each
>> value of "a", for a total of over 40,000 partitions.
>>
>> Before running any transformations or actions on the DataFrame, just
>> initializing it like this takes *2 minutes*:
>>
>> val df = sqlContext.read.parquet("asdf")
>>
>>
>> Is this normal? Is this because it is doing some bookeeping to discover
>> all the partitions? Is it perhaps having to merge the schema from each
>> partition? Would you expect it to get better or worse if I subpartition by
>> another key?
>>
>> - Philip
>>
>>
>>
>>
>

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Reply via email to