For the record, I tried this, and it worked.
On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat walrusthe...@gmail.comwrote:
Oh so if I had something more reasonable, like RDD's full of tuples of
say, (Int,Set,Set), I could expect a more uniform distribution?
Thanks
On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia
matei.zaha...@gmail.comwrote:
This happened because they were integers equal to 0 mod 5, and we used
the default hashCode implementation for integers, which will map them all
to 0. There's no API method that will look at the resulting partition sizes
and rebalance them, but you could use another hash function.
Matei
On Mar 24, 2014, at 5:20 PM, Walrus theCat walrusthe...@gmail.com
wrote:
Hi,
sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect yields
Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())
How do I get something more like:
Array(Array(0), Array(20), Array(40), Array(60), Array(80))
Thanks