Re: rdd only with one partition

Ted Yu Mon, 21 Dec 2015 09:50:19 -0800

I am not familiar with your use case, is it possible to perform the
randomized combination operation based on subset of the rows in rdd0 ?
That way you can increase the parallelism.


Cheers

On Mon, Dec 21, 2015 at 9:40 AM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:

> Hi Ted,
>
> Thanks a lot for your kind reply.
>
> I needs to convert this rdd0 into another rdd1, rows of  rdd1 are
> generated from rdd0's row randomly combination operation.
> From that perspective, rdd0 would be with one partition in order to
> randomly operate on its all rows, however, it would also lose spark
> parallelism benefit .
>
> Best Wishes!
> Zhiliang
>
>
>
>
> On Monday, December 21, 2015 11:17 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>
> Have you tried the following method ?
>
>    * Note: With shuffle = true, you can actually coalesce to a larger
> number
>    * of partitions. This is useful if you have a small number of
> partitions,
>    * say 100, potentially with a few partitions being abnormally large.
> Calling
>    * coalesce(1000, shuffle = true) will result in 1000 partitions with the
>    * data distributed using a hash partitioner.
>    */
>   def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord:
> Ordering[T] = null)
>
> Cheers
>
> On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid
> > wrote:
>
> Dear All,
>
> For some rdd, while there is just one partition, then the operation &
> arithmetic would only be single, the rdd has lose all the parallelism
> benefit from spark  system ...
>
> Is it exactly like that?
>
> Thanks very much in advance!
> Zhiliang
>
>
>
>
>
>

Re: rdd only with one partition

Reply via email to