Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
For the record, I tried this, and it worked.


On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat walrusthe...@gmail.comwrote:

 Oh so if I had something more reasonable, like RDD's full of tuples of
 say, (Int,Set,Set), I could expect a more uniform distribution?

 Thanks


 On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 This happened because they were integers equal to 0 mod 5, and we used
 the default hashCode implementation for integers, which will map them all
 to 0. There's no API method that will look at the resulting partition sizes
 and rebalance them, but you could use another hash function.

 Matei

 On Mar 24, 2014, at 5:20 PM, Walrus theCat walrusthe...@gmail.com
 wrote:

  Hi,
 
  sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0
 ).coalesce(5,true).glom.collect  yields
 
  Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
 Array(), Array())
 
  How do I get something more like:
 
   Array(Array(0), Array(20), Array(40), Array(60), Array(80))
 
  Thanks





coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi,

sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect  yields

Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())

How do I get something more like:

 Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Thanks