I guess the question is why does spark have to do partition discovery with all partitions when the query only needs to look at one partition? Is there a conf flag to turn this off?
On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.wea...@gmail.com> wrote: > I've had the same problem. It turns out that Spark (specifically parquet) > is very slow at partition discovery. It got better in 1.5 (not yet > released), but was still unacceptably slow. Sadly, we ended up reading > parquet files manually in Python (via C++) and had to abandon Spark SQL > because of this problem. > > On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com> > wrote: > >> Hi all, >> >> I did a simple experiment with Spark SQL. I created a partitioned parquet >> table with only one partition (date=20140701). A simple `select count(*) >> from table where date=20140701` would run very fast (0.1 seconds). However, >> as I added more partitions the query takes longer and longer. When I added >> about 10,000 partitions, the query took way too long. I feel like querying >> for a single partition should not be affected by having more partitions. Is >> this a known behaviour? What does spark try to do here? >> >> Thanks, >> Jerrick >> > >