Re: Spark Sql behaves strangely with tables with a lot of partitions

Jerrick Hoang Mon, 24 Aug 2015 12:22:28 -0700

@Michael: would listStatus calls read the actual parquet footers within the
folders?


On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott <scott.sere...@nielsen.com>
wrote:

> Can you please remove me from this distribution list?
>
>
>
> (Filling up my inbox too fast)
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Monday, August 24, 2015 2:13 PM
> *To:* Philip Weaver <philip.wea...@gmail.com>
> *Cc:* Jerrick Hoang <jerrickho...@gmail.com>; Raghavendra Pandey <
> raghavendra.pan...@gmail.com>; User <user@spark.apache.org>; Cheng, Hao <
> hao.ch...@intel.com>
>
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I think we are mostly bottlenecked at this point by how fast we can make
> listStatus calls to discover the folders.  That said, we are happy to
> accept suggestions or PRs to make this faster.  Perhaps you can describe
> how your home grown partitioning works?
>
>
>
> On Sun, Aug 23, 2015 at 7:38 PM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
>
> 1 minute to discover 1000s of partitions -- yes, that is what I have
> observed. And I would assert that is very slow.
>
>
>
> On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> We should not be actually scanning all of the data of all of the
> partitions, but we do need to at least list all of the available
> directories so that we can apply your predicates to the actual values that
> are present when we are deciding which files need to be read in a given
> spark job.  While this is a somewhat expensive operation, we do it in
> parallel and we cache this information when you access the same relation
> more than once.
>
>
>
> Can you provide a little more detail about how exactly you are accessing
> the parquet data (are you using sqlContext.read or creating persistent
> tables in the metastore?), and how long it is taking?  It would also be
> good to know how many partitions we are talking about and how much data is
> in each.  Finally, I'd like to see the stacktrace where it is hanging to
> make sure my above assertions are correct.
>
>
>
> We have several tables internally that have 1000s of partitions and while
> it takes ~1 minute initially to discover the metadata, after that we are
> able to query the data interactively.
>
>
>
>
>
>
>
> On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang <jerrickho...@gmail.com>
> wrote:
>
> anybody has any suggestions?
>
>
>
> On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickho...@gmail.com>
> wrote:
>
> Is there a workaround without updating Hadoop? Would really appreciate if
> someone can explain what spark is trying to do here and what is an easy way
> to turn this off. Thanks all!
>
>
>
> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
> Did you try with hadoop version 2.7.1 .. It is known that s3a works really
> well with parquet which is available in 2.7. They fixed lot of issues
> related to metadata reading there...
>
> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickho...@gmail.com> wrote:
>
> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>
>
>
> (table is partitioned by date_prefix and hour)
>
> explain select count(*) from test_table where date_prefix='20150819' and
> hour='00';
>
>
>
> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>
>  TungstenExchange SinglePartition
>
>   TungstenAggregate(key=[],
> value=[(count(1),mode=Partial,isDistinct=false)]
>
>    Scan ParquetRelation[ .. <about 1000 partition paths go here> ]
>
>
>
> Why does spark have to scan all partitions when the query only concerns
> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>
>
>
> Thanks!
>
>
>
> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
>
> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
> I couldn't find much information about it online. What does it mean exactly
> to disable it? Are there any negative consequences to disabling it?
>
>
>
> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> Can you make some more profiling? I am wondering if the driver is busy
> with scanning the HDFS / S3.
>
> Like jstack <pid of driver process>
>
>
>
> And also, it’s will be great if you can paste the physical plan for the
> simple query.
>
>
>
> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
> *Sent:* Thursday, August 20, 2015 1:46 PM
> *To:* Cheng, Hao
> *Cc:* Philip Weaver; user
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
> CLs trying to speed up spark sql with tables with a huge number of
> partitions, I've made sure that those CLs are included but it's still very
> slow
>
>
>
> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
> false.
>
>
>
> BTW, which version are you using?
>
>
>
> Hao
>
>
>
> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
> *Sent:* Thursday, August 20, 2015 12:16 PM
> *To:* Philip Weaver
> *Cc:* user
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I guess the question is why does spark have to do partition discovery with
> all partitions when the query only needs to look at one partition? Is there
> a conf flag to turn this off?
>
>
>
> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
>
> I've had the same problem. It turns out that Spark (specifically parquet)
> is very slow at partition discovery. It got better in 1.5 (not yet
> released), but was still unacceptably slow. Sadly, we ended up reading
> parquet files manually in Python (via C++) and had to abandon Spark SQL
> because of this problem.
>
>
>
> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com>
> wrote:
>
> Hi all,
>
>
>
> I did a simple experiment with Spark SQL. I created a partitioned parquet
> table with only one partition (date=20140701). A simple `select count(*)
> from table where date=20140701` would run very fast (0.1 seconds). However,
> as I added more partitions the query takes longer and longer. When I added
> about 10,000 partitions, the query took way too long. I feel like querying
> for a single partition should not be affected by having more partitions. Is
> this a known behaviour? What does spark try to do here?
>
>
>
> Thanks,
>
> Jerrick
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Spark Sql behaves strangely with tables with a lot of partitions

Reply via email to