Re: GroupByKey causing problem

2015-02-26 Thread Imran Rashid
Hi Tushar, The most scalable option is probably for you to consider doing some approximation. Eg., sample the first to come up with the bucket boundaries. Then you can assign data points to buckets without needing to do a full groupByKey. You could even have more passes which corrects any

GroupByKey causing problem

2015-02-26 Thread Tushar Sharma
Hi, I am trying to apply binning to a large CSV dataset. Here are the steps I am taking: 1. Emit each value of CSV as (ColIndex,(RowIndex,value)) 2. Then I groupByKey (here ColumnIndex) and get all values of a particular index to one node, as I have to work on the collection of all values 3. I