Re: Splitting RDD to exact number of partitions

Ted Yu Tue, 31 May 2016 07:30:57 -0700

Maciej:
You can refer to the doc in
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
for these parameters.


On Tue, May 31, 2016 at 7:27 AM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> If you don't hesitate the newest version, you try to use v2.0-preview.
> http://spark.apache.org/news/spark-2.0.0-preview.html
>
> There, you can control #partitions for input partitions without shuffles
> by two parameters below;
> spark.sql.files.maxPartitionBytes
> spark.sql.files.openCostInBytes
> ( Not documented though, ....
>
> // maropu
>
> On Tue, May 31, 2016 at 11:08 PM, Maciej Sokołowski <matemac...@gmail.com>
> wrote:
>
>> After setting shuffle to true I get expected 128 partitions, but I'm
>> worried about performance of such solution - especially I see that some
>> shuffling is done because size of partitions chages:
>>
>> scala> sc.textFile("hdfs:///proj/dFAB_test/testdata/perf_test1.csv",
>> minPartitions=128).coalesce(128, true).mapPartitions{rows =>
>> Iterator(rows.length)}.collect()
>> res3: Array[Int] = Array(768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 828, 896, 896, 896, 896, 896,
>> 896, 896, 896, 896, 896, 896, 896, 850, 786, 768, 768, 768, 768, 768, 768,
>> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768)
>>
>> I use spark 1.6.0
>>
>>
>> On 31 May 2016 at 16:02, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Value for shuffle is false by default.
>>>
>>> Have you tried setting it to true ?
>>>
>>> Which Spark release are you using ?
>>>
>>> On Tue, May 31, 2016 at 6:13 AM, Maciej Sokołowski <matemac...@gmail.com
>>> > wrote:
>>>
>>>> Hello Spark users and developers.
>>>>
>>>> I read file and want to ensure that it has exact number of partitions,
>>>> for example 128.
>>>>
>>>> In documentation I found:
>>>>
>>>> def textFile(path: String, minPartitions: Int = defaultMinPartitions):
>>>> RDD[String]
>>>>
>>>> But argument here is minimal number of partitions, so I use coalesce to
>>>> ensure desired number of partitions:
>>>>
>>>> def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit
>>>> ord: Ordering[T] = null): RDD[T]
>>>> //Return a new RDD that is reduced into numPartitions partitions.
>>>>
>>>> So I combine them and get number of partitions lower than expected:
>>>>
>>>> scala> sc.textFile("perf_test1.csv",
>>>> minPartitions=128).coalesce(128).getNumPartitions
>>>> res14: Int = 126
>>>>
>>>> Is this expected behaviour? File contains 100000 lines, size of
>>>> partitions before and after coalesce:
>>>>
>>>> scala> sc.textFile("perf_test1.csv",
>>>> minPartitions=128).mapPartitions{rows => Iterator(rows.length)}.collect()
>>>> res16: Array[Int] = Array(782, 781, 782, 781, 781, 782, 781, 781, 781,
>>>> 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781,
>>>> 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782,
>>>> 781, 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782,
>>>> 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781,
>>>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781,
>>>> 781, 782, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>>> 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781,
>>>> 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781)
>>>>
>>>> scala> sc.textFile("perf_test1.csv",
>>>> minPartitions=128).coalesce(128).mapPartitions{rows =>
>>>> Iterator(rows.length)}.collect()
>>>> res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781,
>>>> 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781,
>>>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>>> 781, 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781,
>>>> 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781,
>>>> 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781,
>>>> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>>> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782,
>>>> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782)
>>>>
>>>> So two partitions are double the size. Is this expected behaviour or is
>>>> it some kind of bug?
>>>>
>>>> Thanks,
>>>> Maciej Sokołowski
>>>>
>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Splitting RDD to exact number of partitions

Reply via email to