Re: Spark Sql behaves strangely with tables with a lot of partitions

Philip Weaver Sun, 23 Aug 2015 19:39:12 -0700

1 minute to discover 1000s of partitions -- yes, that is what I have
observed. And I would assert that is very slow.


On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> We should not be actually scanning all of the data of all of the
> partitions, but we do need to at least list all of the available
> directories so that we can apply your predicates to the actual values that
> are present when we are deciding which files need to be read in a given
> spark job.  While this is a somewhat expensive operation, we do it in
> parallel and we cache this information when you access the same relation
> more than once.
>
> Can you provide a little more detail about how exactly you are accessing
> the parquet data (are you using sqlContext.read or creating persistent
> tables in the metastore?), and how long it is taking?  It would also be
> good to know how many partitions we are talking about and how much data is
> in each.  Finally, I'd like to see the stacktrace where it is hanging to
> make sure my above assertions are correct.
>
> We have several tables internally that have 1000s of partitions and while
> it takes ~1 minute initially to discover the metadata, after that we are
> able to query the data interactively.
>
>
>
> On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang <jerrickho...@gmail.com>
> wrote:
>
>> anybody has any suggestions?
>>
>> On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickho...@gmail.com>
>> wrote:
>>
>>> Is there a workaround without updating Hadoop? Would really appreciate
>>> if someone can explain what spark is trying to do here and what is an easy
>>> way to turn this off. Thanks all!
>>>
>>> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
>>> raghavendra.pan...@gmail.com> wrote:
>>>
>>>> Did you try with hadoop version 2.7.1 .. It is known that s3a works
>>>> really well with parquet which is available in 2.7. They fixed lot of
>>>> issues related to metadata reading there...
>>>> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickho...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>>>>>
>>>>> (table is partitioned by date_prefix and hour)
>>>>> explain select count(*) from test_table where date_prefix='20150819'
>>>>> and hour='00';
>>>>>
>>>>> TungstenAggregate(key=[],
>>>>> value=[(count(1),mode=Final,isDistinct=false)]
>>>>>  TungstenExchange SinglePartition
>>>>>   TungstenAggregate(key=[],
>>>>> value=[(count(1),mode=Partial,isDistinct=false)]
>>>>>    Scan ParquetRelation[ .. <about 1000 partition paths go here> ]
>>>>>
>>>>> Why does spark have to scan all partitions when the query only
>>>>> concerns with 1 partitions? Doesn't it defeat the purpose of partitioning?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <
>>>>> philip.wea...@gmail.com> wrote:
>>>>>
>>>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled
>>>>>> before, and I couldn't find much information about it online. What does 
>>>>>> it
>>>>>> mean exactly to disable it? Are there any negative consequences to
>>>>>> disabling it?
>>>>>>
>>>>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you make some more profiling? I am wondering if the driver is
>>>>>>> busy with scanning the HDFS / S3.
>>>>>>>
>>>>>>> Like jstack <pid of driver process>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> And also, it’s will be great if you can paste the physical plan for
>>>>>>> the simple query.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>>>>>> *To:* Cheng, Hao
>>>>>>> *Cc:* Philip Weaver; user
>>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot
>>>>>>> of partitions
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple
>>>>>>> of CLs trying to speed up spark sql with tables with a huge number of
>>>>>>> partitions, I've made sure that those CLs are included but it's still 
>>>>>>> very
>>>>>>> slow
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Yes, you can try set the
>>>>>>> spark.sql.sources.partitionDiscovery.enabled to false.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> BTW, which version are you using?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>>>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>>>>>> *To:* Philip Weaver
>>>>>>> *Cc:* user
>>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot
>>>>>>> of partitions
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I guess the question is why does spark have to do partition
>>>>>>> discovery with all partitions when the query only needs to look at one
>>>>>>> partition? Is there a conf flag to turn this off?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <
>>>>>>> philip.wea...@gmail.com> wrote:
>>>>>>>
>>>>>>> I've had the same problem. It turns out that Spark (specifically
>>>>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not 
>>>>>>> yet
>>>>>>> released), but was still unacceptably slow. Sadly, we ended up reading
>>>>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>>>>>>> because of this problem.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <
>>>>>>> jerrickho...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I did a simple experiment with Spark SQL. I created a partitioned
>>>>>>> parquet table with only one partition (date=20140701). A simple `select
>>>>>>> count(*) from table where date=20140701` would run very fast (0.1 
>>>>>>> seconds).
>>>>>>> However, as I added more partitions the query takes longer and longer. 
>>>>>>> When
>>>>>>> I added about 10,000 partitions, the query took way too long. I feel 
>>>>>>> like
>>>>>>> querying for a single partition should not be affected by having more
>>>>>>> partitions. Is this a known behaviour? What does spark try to do here?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jerrick
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Spark Sql behaves strangely with tables with a lot of partitions

Reply via email to