I am not familiar with your use case, is it possible to perform the randomized combination operation based on subset of the rows in rdd0 ? That way you can increase the parallelism.
Cheers On Mon, Dec 21, 2015 at 9:40 AM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote: > Hi Ted, > > Thanks a lot for your kind reply. > > I needs to convert this rdd0 into another rdd1, rows of rdd1 are > generated from rdd0's row randomly combination operation. > From that perspective, rdd0 would be with one partition in order to > randomly operate on its all rows, however, it would also lose spark > parallelism benefit . > > Best Wishes! > Zhiliang > > > > > On Monday, December 21, 2015 11:17 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Have you tried the following method ? > > * Note: With shuffle = true, you can actually coalesce to a larger > number > * of partitions. This is useful if you have a small number of > partitions, > * say 100, potentially with a few partitions being abnormally large. > Calling > * coalesce(1000, shuffle = true) will result in 1000 partitions with the > * data distributed using a hash partitioner. > */ > def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: > Ordering[T] = null) > > Cheers > > On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid > > wrote: > > Dear All, > > For some rdd, while there is just one partition, then the operation & > arithmetic would only be single, the rdd has lose all the parallelism > benefit from spark system ... > > Is it exactly like that? > > Thanks very much in advance! > Zhiliang > > > > > >