Hi Tushar,
The most scalable option is probably for you to consider doing some
approximation. Eg., sample the first to come up with the bucket
boundaries. Then you can assign data points to buckets without needing to
do a full groupByKey. You could even have more passes which corrects any
Hi,
I am trying to apply binning to a large CSV dataset. Here are the steps I
am taking:
1. Emit each value of CSV as (ColIndex,(RowIndex,value))
2. Then I groupByKey (here ColumnIndex) and get all values of a particular
index to one node, as I have to work on the collection of all values
3. I