If your input data is line delimited text, you can randomize it using sort. sort -R, --random-sort sort by random hash of keys
Use gsort on *bsd. -R On Thu, Mar 12, 2009 at 12:33 PM, stack <[email protected]> wrote: > On Thu, Mar 12, 2009 at 8:05 AM, Mat Hofschen <[email protected]> wrote: > > > ... > > > > I can see that our hardware will not be sufficient. For now this is a > > testlab setup and will have to be upgraded. > > > > Unless you do as Jon suggests and split your cluster. Apart from Jon's > suggestion, you might also consider running the regionservers and datanodes > together and tasktrackers elsewhere. > > > > > One more question to understand the scenario better: > > I have 120 reduce jobs running on all nodes and there is only one node > that > > hosts the initial region. Then all 120 reduce jobs are trying to write to > > this one machine? > > > Yes. > > > > What happens then if the region is split? Do some of the > > Reduce Jobs notice that write ops go to a new region, or are they still > > writing to the first region which then redirects traffic? > > > All reducers notice the split and will write to the appropriate region. > For > example, on first split, assuming your MR job sorted rows, half the load > should go to the first region and the other half to the second. > > St.Ack >
