subject:"RDD groupBy then random sort each group \?"

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang

thanks, this direction seems to be inline with what I want. what i really want is groupBy() and then for the rows in each group, get an Iterator, and run each element from the iterator through a local function (specifically SGD), right now the DataSet API provides this , but it's literally an

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang

thanks. exactly this is what I ended up doing finally. though it seemed to work, there seems to be guarantee that the randomness after the sortWithinPartitions() would be preserved after I do a further groupBy. On Fri, Oct 21, 2016 at 3:55 PM, Cheng Lian wrote: > I think

Re: RDD groupBy() then random sort each group ?

2016-10-22 Thread Koert Kuipers

groupBy always materializes the entire group (on disk or in memory) which is why you should avoid it for large groups. The key is to never materialize the grouped and shuffled data. To see one approach to do this take a look at https://github.com/tresata/spark-sorted It's basically a

Re: RDD groupBy() then random sort each group ?

2016-10-21 Thread Cheng Lian

I think it would much easier to use DataFrame API to do this by doing local sort using randn() as key. For example, in Spark 2.0: val df = spark.range(100) val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42)) Replace df with a DataFrame wrapping your RDD, and $"id" % 10

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang

in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD

Re: RDD groupBy() then random sort each group ?

Re: RDD groupBy() then random sort each group ?

Re: RDD groupBy() then random sort each group ?

Re: RDD groupBy() then random sort each group ?

RDD groupBy() then random sort each group ?

5 matches

Site Navigation

Mail list logo

Footer information