Did you try with hadoop version 2.7.1 .. It is known that s3a works really
well with parquet which is available in 2.7. They fixed lot of issues
related to metadata reading there...
On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickho...@gmail.com> wrote:

> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>
> (table is partitioned by date_prefix and hour)
> explain select count(*) from test_table where date_prefix='20150819' and
> hour='00';
>
> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[],
> value=[(count(1),mode=Partial,isDistinct=false)]
>    Scan ParquetRelation[ .. <about 1000 partition paths go here> ]
>
> Why does spark have to scan all partitions when the query only concerns
> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>
> Thanks!
>
> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
>
>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
>> and I couldn't find much information about it online. What does it mean
>> exactly to disable it? Are there any negative consequences to disabling it?
>>
>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>
>>> Can you make some more profiling? I am wondering if the driver is busy
>>> with scanning the HDFS / S3.
>>>
>>> Like jstack <pid of driver process>
>>>
>>>
>>>
>>> And also, it’s will be great if you can paste the physical plan for the
>>> simple query.
>>>
>>>
>>>
>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>> *To:* Cheng, Hao
>>> *Cc:* Philip Weaver; user
>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>> partitions
>>>
>>>
>>>
>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
>>> CLs trying to speed up spark sql with tables with a huge number of
>>> partitions, I've made sure that those CLs are included but it's still very
>>> slow
>>>
>>>
>>>
>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com>
>>> wrote:
>>>
>>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
>>> false.
>>>
>>>
>>>
>>> BTW, which version are you using?
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>> *To:* Philip Weaver
>>> *Cc:* user
>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>> partitions
>>>
>>>
>>>
>>> I guess the question is why does spark have to do partition discovery
>>> with all partitions when the query only needs to look at one partition? Is
>>> there a conf flag to turn this off?
>>>
>>>
>>>
>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.wea...@gmail.com>
>>> wrote:
>>>
>>> I've had the same problem. It turns out that Spark (specifically
>>> parquet) is very slow at partition discovery. It got better in 1.5 (not yet
>>> released), but was still unacceptably slow. Sadly, we ended up reading
>>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>>> because of this problem.
>>>
>>>
>>>
>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I did a simple experiment with Spark SQL. I created a partitioned
>>> parquet table with only one partition (date=20140701). A simple `select
>>> count(*) from table where date=20140701` would run very fast (0.1 seconds).
>>> However, as I added more partitions the query takes longer and longer. When
>>> I added about 10,000 partitions, the query took way too long. I feel like
>>> querying for a single partition should not be affected by having more
>>> partitions. Is this a known behaviour? What does spark try to do here?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jerrick
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to