I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried again.
The initialization time is about 1 minute now, which is still pretty terrible. On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver <philip.wea...@gmail.com> wrote: > Absolutely, thanks! > > On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 >> >> Could you give it a shot to see whether it helps in your case? We've >> observed ~50x performance boost with schema merging turned on. >> >> Cheng >> >> >> On 8/6/15 8:26 AM, Philip Weaver wrote: >> >> I have a parquet directory that was produced by partitioning by two keys, >> e.g. like this: >> >> df.write.partitionBy("a", "b").parquet("asdf") >> >> >> There are 35 values of "a", and about 1100-1200 values of "b" for each >> value of "a", for a total of over 40,000 partitions. >> >> Before running any transformations or actions on the DataFrame, just >> initializing it like this takes *2 minutes*: >> >> val df = sqlContext.read.parquet("asdf") >> >> >> Is this normal? Is this because it is doing some bookeeping to discover >> all the partitions? Is it perhaps having to merge the schema from each >> partition? Would you expect it to get better or worse if I subpartition by >> another key? >> >> - Philip >> >> >> >> >