anybody has any suggestions? On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickho...@gmail.com> wrote:
> Is there a workaround without updating Hadoop? Would really appreciate if > someone can explain what spark is trying to do here and what is an easy way > to turn this off. Thanks all! > > On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > >> Did you try with hadoop version 2.7.1 .. It is known that s3a works >> really well with parquet which is available in 2.7. They fixed lot of >> issues related to metadata reading there... >> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickho...@gmail.com> wrote: >> >>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3! >>> >>> (table is partitioned by date_prefix and hour) >>> explain select count(*) from test_table where date_prefix='20150819' and >>> hour='00'; >>> >>> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] >>> TungstenExchange SinglePartition >>> TungstenAggregate(key=[], >>> value=[(count(1),mode=Partial,isDistinct=false)] >>> Scan ParquetRelation[ .. <about 1000 partition paths go here> ] >>> >>> Why does spark have to scan all partitions when the query only concerns >>> with 1 partitions? Doesn't it defeat the purpose of partitioning? >>> >>> Thanks! >>> >>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.wea...@gmail.com> >>> wrote: >>> >>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, >>>> and I couldn't find much information about it online. What does it mean >>>> exactly to disable it? Are there any negative consequences to disabling it? >>>> >>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> >>>> wrote: >>>> >>>>> Can you make some more profiling? I am wondering if the driver is busy >>>>> with scanning the HDFS / S3. >>>>> >>>>> Like jstack <pid of driver process> >>>>> >>>>> >>>>> >>>>> And also, it’s will be great if you can paste the physical plan for >>>>> the simple query. >>>>> >>>>> >>>>> >>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>>>> *Sent:* Thursday, August 20, 2015 1:46 PM >>>>> *To:* Cheng, Hao >>>>> *Cc:* Philip Weaver; user >>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >>>>> partitions >>>>> >>>>> >>>>> >>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple >>>>> of CLs trying to speed up spark sql with tables with a huge number of >>>>> partitions, I've made sure that those CLs are included but it's still very >>>>> slow >>>>> >>>>> >>>>> >>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com> >>>>> wrote: >>>>> >>>>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled >>>>> to false. >>>>> >>>>> >>>>> >>>>> BTW, which version are you using? >>>>> >>>>> >>>>> >>>>> Hao >>>>> >>>>> >>>>> >>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>>>> *Sent:* Thursday, August 20, 2015 12:16 PM >>>>> *To:* Philip Weaver >>>>> *Cc:* user >>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >>>>> partitions >>>>> >>>>> >>>>> >>>>> I guess the question is why does spark have to do partition discovery >>>>> with all partitions when the query only needs to look at one partition? Is >>>>> there a conf flag to turn this off? >>>>> >>>>> >>>>> >>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver < >>>>> philip.wea...@gmail.com> wrote: >>>>> >>>>> I've had the same problem. It turns out that Spark (specifically >>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not >>>>> yet >>>>> released), but was still unacceptably slow. Sadly, we ended up reading >>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL >>>>> because of this problem. >>>>> >>>>> >>>>> >>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> >>>>> >>>>> I did a simple experiment with Spark SQL. I created a partitioned >>>>> parquet table with only one partition (date=20140701). A simple `select >>>>> count(*) from table where date=20140701` would run very fast (0.1 >>>>> seconds). >>>>> However, as I added more partitions the query takes longer and longer. >>>>> When >>>>> I added about 10,000 partitions, the query took way too long. I feel like >>>>> querying for a single partition should not be affected by having more >>>>> partitions. Is this a known behaviour? What does spark try to do here? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Jerrick >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >