Re: RDD groupBy() then random sort each group ?

Cheng Lian Fri, 21 Oct 2016 15:56:18 -0700

I think it would much easier to use DataFrame API to do this by doinglocal sort using randn() as key. For example, in Spark 2.0:


val df = spark.range(100)
val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42))

Replace df with a DataFrame wrapping your RDD, and $"id" % 10 with thekey to group by, then you can get the RDD from shuffled and do thefollowing operations you want.


Cheng


On 10/20/16 10:53 AM, Yang wrote:

in my application, I group by same training samples by theirmodel_id's (the input table contains training samples for 100kdifferent models), then each group ends up having about 1 milliontraining samples,
then I feed that group of samples to a little Logistic Regressionsolver (SGD), but SGD requires the input data to be shuffled randomly(so that positive and negative samples are evenly distributed), so nowI do something like
my_input_rdd.groupBy(x=>x.model_id).map(x=>
    val (model_id, group_of_rows) = x

     (model_id, group_of_rows.toSeq().shuffle() )

).map(x=> (x._1, train_sgd(x._2))
the issue is that on the 3rd row above, I had to explicitly calltoSeq() on the group_of_rows in order to shuffle, which is an Iterableand not Seq. now I have to load the entire 1 million rows into memory,and in practice I've seen my tasks OOM and GC time goes crazy (about50% of total run time). I suspect this toSeq() is the reason, sincedoing a simple count() on the groupBy() result works fine.
I am planning to shuffle the my_input_rdd first, and then groupBy(),and not do the toSeq().shuffle(). intuitively the input rdd is alreadyshuffled, so UNLESS groupBy() tries to do some sorting, the rows inthe group SHOULD remain shuffled ???? but overall this remains ratherflimsy.
any ideas to do this more reliably?

thanks!



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RDD groupBy() then random sort each group ?

Reply via email to