1 minute to discover 1000s of partitions -- yes, that is what I have observed. And I would assert that is very slow.
On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust <mich...@databricks.com> wrote: > We should not be actually scanning all of the data of all of the > partitions, but we do need to at least list all of the available > directories so that we can apply your predicates to the actual values that > are present when we are deciding which files need to be read in a given > spark job. While this is a somewhat expensive operation, we do it in > parallel and we cache this information when you access the same relation > more than once. > > Can you provide a little more detail about how exactly you are accessing > the parquet data (are you using sqlContext.read or creating persistent > tables in the metastore?), and how long it is taking? It would also be > good to know how many partitions we are talking about and how much data is > in each. Finally, I'd like to see the stacktrace where it is hanging to > make sure my above assertions are correct. > > We have several tables internally that have 1000s of partitions and while > it takes ~1 minute initially to discover the metadata, after that we are > able to query the data interactively. > > > > On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang <jerrickho...@gmail.com> > wrote: > >> anybody has any suggestions? >> >> On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickho...@gmail.com> >> wrote: >> >>> Is there a workaround without updating Hadoop? Would really appreciate >>> if someone can explain what spark is trying to do here and what is an easy >>> way to turn this off. Thanks all! >>> >>> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey < >>> raghavendra.pan...@gmail.com> wrote: >>> >>>> Did you try with hadoop version 2.7.1 .. It is known that s3a works >>>> really well with parquet which is available in 2.7. They fixed lot of >>>> issues related to metadata reading there... >>>> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickho...@gmail.com> >>>> wrote: >>>> >>>>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3! >>>>> >>>>> (table is partitioned by date_prefix and hour) >>>>> explain select count(*) from test_table where date_prefix='20150819' >>>>> and hour='00'; >>>>> >>>>> TungstenAggregate(key=[], >>>>> value=[(count(1),mode=Final,isDistinct=false)] >>>>> TungstenExchange SinglePartition >>>>> TungstenAggregate(key=[], >>>>> value=[(count(1),mode=Partial,isDistinct=false)] >>>>> Scan ParquetRelation[ .. <about 1000 partition paths go here> ] >>>>> >>>>> Why does spark have to scan all partitions when the query only >>>>> concerns with 1 partitions? Doesn't it defeat the purpose of partitioning? >>>>> >>>>> Thanks! >>>>> >>>>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver < >>>>> philip.wea...@gmail.com> wrote: >>>>> >>>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled >>>>>> before, and I couldn't find much information about it online. What does >>>>>> it >>>>>> mean exactly to disable it? Are there any negative consequences to >>>>>> disabling it? >>>>>> >>>>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>> wrote: >>>>>> >>>>>>> Can you make some more profiling? I am wondering if the driver is >>>>>>> busy with scanning the HDFS / S3. >>>>>>> >>>>>>> Like jstack <pid of driver process> >>>>>>> >>>>>>> >>>>>>> >>>>>>> And also, it’s will be great if you can paste the physical plan for >>>>>>> the simple query. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM >>>>>>> *To:* Cheng, Hao >>>>>>> *Cc:* Philip Weaver; user >>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot >>>>>>> of partitions >>>>>>> >>>>>>> >>>>>>> >>>>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple >>>>>>> of CLs trying to speed up spark sql with tables with a huge number of >>>>>>> partitions, I've made sure that those CLs are included but it's still >>>>>>> very >>>>>>> slow >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>>> wrote: >>>>>>> >>>>>>> Yes, you can try set the >>>>>>> spark.sql.sources.partitionDiscovery.enabled to false. >>>>>>> >>>>>>> >>>>>>> >>>>>>> BTW, which version are you using? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hao >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>>>>>> *Sent:* Thursday, August 20, 2015 12:16 PM >>>>>>> *To:* Philip Weaver >>>>>>> *Cc:* user >>>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot >>>>>>> of partitions >>>>>>> >>>>>>> >>>>>>> >>>>>>> I guess the question is why does spark have to do partition >>>>>>> discovery with all partitions when the query only needs to look at one >>>>>>> partition? Is there a conf flag to turn this off? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver < >>>>>>> philip.wea...@gmail.com> wrote: >>>>>>> >>>>>>> I've had the same problem. It turns out that Spark (specifically >>>>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not >>>>>>> yet >>>>>>> released), but was still unacceptably slow. Sadly, we ended up reading >>>>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL >>>>>>> because of this problem. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang < >>>>>>> jerrickho...@gmail.com> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I did a simple experiment with Spark SQL. I created a partitioned >>>>>>> parquet table with only one partition (date=20140701). A simple `select >>>>>>> count(*) from table where date=20140701` would run very fast (0.1 >>>>>>> seconds). >>>>>>> However, as I added more partitions the query takes longer and longer. >>>>>>> When >>>>>>> I added about 10,000 partitions, the query took way too long. I feel >>>>>>> like >>>>>>> querying for a single partition should not be affected by having more >>>>>>> partitions. Is this a known behaviour? What does spark try to do here? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jerrick >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>> >> >