I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and I couldn't find much information about it online. What does it mean exactly to disable it? Are there any negative consequences to disabling it?
On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > Can you make some more profiling? I am wondering if the driver is busy > with scanning the HDFS / S3. > > Like jstack <pid of driver process> > > > > And also, it’s will be great if you can paste the physical plan for the > simple query. > > > > *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] > *Sent:* Thursday, August 20, 2015 1:46 PM > *To:* Cheng, Hao > *Cc:* Philip Weaver; user > *Subject:* Re: Spark Sql behaves strangely with tables with a lot of > partitions > > > > I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of > CLs trying to speed up spark sql with tables with a huge number of > partitions, I've made sure that those CLs are included but it's still very > slow > > > > On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > > Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to > false. > > > > BTW, which version are you using? > > > > Hao > > > > *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] > *Sent:* Thursday, August 20, 2015 12:16 PM > *To:* Philip Weaver > *Cc:* user > *Subject:* Re: Spark Sql behaves strangely with tables with a lot of > partitions > > > > I guess the question is why does spark have to do partition discovery with > all partitions when the query only needs to look at one partition? Is there > a conf flag to turn this off? > > > > On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.wea...@gmail.com> > wrote: > > I've had the same problem. It turns out that Spark (specifically parquet) > is very slow at partition discovery. It got better in 1.5 (not yet > released), but was still unacceptably slow. Sadly, we ended up reading > parquet files manually in Python (via C++) and had to abandon Spark SQL > because of this problem. > > > > On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com> > wrote: > > Hi all, > > > > I did a simple experiment with Spark SQL. I created a partitioned parquet > table with only one partition (date=20140701). A simple `select count(*) > from table where date=20140701` would run very fast (0.1 seconds). However, > as I added more partitions the query takes longer and longer. When I added > about 10,000 partitions, the query took way too long. I feel like querying > for a single partition should not be affected by having more partitions. Is > this a known behaviour? What does spark try to do here? > > > > Thanks, > > Jerrick > > > > > > >