So do you get 2171 as the output for that command? That command tells you
how many partitions your RDD has, so it’s good to first confirm that rdd1
has as many partitions as you think it has.
​


On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert <alex.boisv...@gmail.com>
wrote:

> It's actually a set of 2171 S3 files, with an average size of about 18MB.
>
>
> On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> What do you get for rdd1._jrdd.splits().size()? You might think you’re
>> getting > 100 partitions, but it may not be happening.
>> ​
>>
>>
>> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com>
>> wrote:
>>
>>> With the following pseudo-code,
>>>
>>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
>>> val rdd2 = rdd1.coalesce(100)
>>> val rdd3 = rdd2 map { ... }
>>> val rdd4 = rdd3.coalesce(2)
>>> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>>>
>>> I would expect the parallelism of the map() operation to be 100
>>> concurrent tasks, and the parallelism of the save() operation to be 2.
>>>
>>> However, it appears the parallelism of the entire chain is 2 -- I only
>>> see two tasks created for the save() operation and those tasks appear to
>>> execute the map() operation as well.
>>>
>>> Assuming what I'm seeing is as-specified (meaning, how things are meant
>>> to be), what's the recommended way to force a parallelism of 100 on the
>>> map() operation?
>>>
>>> thanks!
>>>
>>>
>>>
>>
>

Reply via email to