回复:Spark DataFrames uses too many partition
Hi, I want to know how you coalesce the partition to one to improve the performance Thanks 在2015年08月11日 23:31,Al M 写道: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together. Every file is loaded by Spark with just one partition. Each time I join two small files the partition count increases to 200. This makes my application take 10x as long as if I coalesce everything to 1 partition after each join. With normal RDDs it would not expand out the partitions to 200 after joining two files with one partition each. It would either keep it at one or expand it to two. Why do DataFrames expand out the partitions so much? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark DataFrames uses too many partition
The DataFrames parallelism currently controlled through configuration option spark.sql.shuffle.partitions. The default value is 200 I have raised an Improvement Jira to make it possible to specify the number of partitions in https://issues.apache.org/jira/browse/SPARK-9872 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214p24223.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark DataFrames uses too many partition
Thank you Hao; that was a fantastic response. I have raised SPARK-9782 for this. I also would love to have dynamic partitioning. I mentioned it in the Jira. On 12 Aug 2015 02:19, "Cheng, Hao" wrote: > That's a good question, we don't support reading small files in a single > partition yet, but it's definitely an issue we need to optimize, do you > mind to create a jira issue for this? Hopefully we can merge that in 1.6 > release. > > 200 is the default partition number for parallel tasks after the data > shuffle, and we have to change that value according to the file size, > cluster size etc.. > > Ideally, this number would be set dynamically and automatically, however, > spark sql doesn't support the complex cost based model yet, particularly > for the multi-stages job. ( > https://issues.apache.org/jira/browse/SPARK-4630) > > Hao > > -Original Message- > From: Al M [mailto:alasdair.mcbr...@gmail.com] > Sent: Tuesday, August 11, 2015 11:31 PM > To: user@spark.apache.org > Subject: Spark DataFrames uses too many partition > > I am using DataFrames with Spark 1.4.1. I really like DataFrames but the > partitioning makes no sense to me. > > I am loading lots of very small files and joining them together. Every > file is loaded by Spark with just one partition. Each time I join two > small files the partition count increases to 200. This makes my > application take 10x as long as if I coalesce everything to 1 partition > after each join. > > With normal RDDs it would not expand out the partitions to 200 after > joining two files with one partition each. It would either keep it at one > or expand it to two. > > Why do DataFrames expand out the partitions so much? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >
Re: Spark DataFrames uses too many partition
Thanks Silvio! On 11 Aug 2015 17:44, "Silvio Fiorito" wrote: > You need to configure the spark.sql.shuffle.partitions parameter to a > different value. It defaults to 200. > > > > > On 8/11/15, 11:31 AM, "Al M" wrote: > > >I am using DataFrames with Spark 1.4.1. I really like DataFrames but the > >partitioning makes no sense to me. > > > >I am loading lots of very small files and joining them together. Every > file > >is loaded by Spark with just one partition. Each time I join two small > >files the partition count increases to 200. This makes my application > take > >10x as long as if I coalesce everything to 1 partition after each join. > > > >With normal RDDs it would not expand out the partitions to 200 after > joining > >two files with one partition each. It would either keep it at one or > expand > >it to two. > > > >Why do DataFrames expand out the partitions so much? > > > > > > > >-- > >View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >For additional commands, e-mail: user-h...@spark.apache.org > > >
RE: Spark DataFrames uses too many partition
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the data shuffle, and we have to change that value according to the file size, cluster size etc.. Ideally, this number would be set dynamically and automatically, however, spark sql doesn't support the complex cost based model yet, particularly for the multi-stages job. (https://issues.apache.org/jira/browse/SPARK-4630) Hao -Original Message- From: Al M [mailto:alasdair.mcbr...@gmail.com] Sent: Tuesday, August 11, 2015 11:31 PM To: user@spark.apache.org Subject: Spark DataFrames uses too many partition I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together. Every file is loaded by Spark with just one partition. Each time I join two small files the partition count increases to 200. This makes my application take 10x as long as if I coalesce everything to 1 partition after each join. With normal RDDs it would not expand out the partitions to 200 after joining two files with one partition each. It would either keep it at one or expand it to two. Why do DataFrames expand out the partitions so much? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark DataFrames uses too many partition
You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, "Al M" wrote: >I am using DataFrames with Spark 1.4.1. I really like DataFrames but the >partitioning makes no sense to me. > >I am loading lots of very small files and joining them together. Every file >is loaded by Spark with just one partition. Each time I join two small >files the partition count increases to 200. This makes my application take >10x as long as if I coalesce everything to 1 partition after each join. > >With normal RDDs it would not expand out the partitions to 200 after joining >two files with one partition each. It would either keep it at one or expand >it to two. > >Why do DataFrames expand out the partitions so much? > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >
Spark DataFrames uses too many partition
I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together. Every file is loaded by Spark with just one partition. Each time I join two small files the partition count increases to 200. This makes my application take 10x as long as if I coalesce everything to 1 partition after each join. With normal RDDs it would not expand out the partitions to 200 after joining two files with one partition each. It would either keep it at one or expand it to two. Why do DataFrames expand out the partitions so much? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org