RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Sereday, Scott
Can you please remove me from this distribution list? (Filling up my inbox too fast) From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Monday, August 24, 2015 2:13 PM To: Philip Weaver philip.wea...@gmail.com Cc: Jerrick Hoang jerrickho...@gmail.com; Raghavendra Pandey

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
@Michael: would listStatus calls read the actual parquet footers within the folders? On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
No, starting with Spark 1.5 we should by default only be reading the footers on the executor side (that is unless schema merging has been explicitly turned on). On Mon, Aug 24, 2015 at 12:20 PM, Jerrick Hoang jerrickho...@gmail.com wrote: @Michael: would listStatus calls read the actual parquet

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
Follow the directions here: http://spark.apache.org/community.html On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
anybody has any suggestions? On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Is there a workaround without updating Hadoop? Would really appreciate if someone can explain what spark is trying to do here and what is an easy way to turn this off. Thanks all! On

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Philip Weaver
1 minute to discover 1000s of partitions -- yes, that is what I have observed. And I would assert that is very slow. On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust mich...@databricks.com wrote: We should not be actually scanning all of the data of all of the partitions, but we do need to at

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Michael Armbrust
We should not be actually scanning all of the data of all of the partitions, but we do need to at least list all of the available directories so that we can apply your predicates to the actual values that are present when we are deciding which files need to be read in a given spark job. While

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
@Cheng, Hao : Physical plans show that it got stuck on scanning S3! (table is partitioned by date_prefix and hour) explain select count(*) from test_table where date_prefix='20150819' and hour='00'; TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] TungstenExchange

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Raghavendra Pandey
Did you try with hadoop version 2.7.1 .. It is known that s3a works really well with parquet which is available in 2.7. They fixed lot of issues related to metadata reading there... On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote: @Cheng, Hao : Physical plans show that it

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
Is there a workaround without updating Hadoop? Would really appreciate if someone can explain what spark is trying to do here and what is an easy way to turn this off. Thanks all! On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Did you try with hadoop

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made sure that those CLs are included but it's still very slow On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote: Yes,

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Can you make some more profiling? I am wondering if the driver is busy with scanning the HDFS / S3. Like jstack pid of driver process And also, it’s will be great if you can paste the physical plan for the simple query. From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with all partitions when the query only needs to look at one partition? Is there a conf flag to turn this off? On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com wrote: I've had the same problem. It turns

Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all, I did a simple experiment with Spark SQL. I created a partitioned parquet table with only one partition (date=20140701). A simple `select count(*) from table where date=20140701` would run very fast (0.1 seconds). However, as I added more partitions the query takes longer and longer. When

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Philip Weaver
I've had the same problem. It turns out that Spark (specifically parquet) is very slow at partition discovery. It got better in 1.5 (not yet released), but was still unacceptably slow. Sadly, we ended up reading parquet files manually in Python (via C++) and had to abandon Spark SQL because of