If your input data is line delimited text, you can randomize it using sort.
sort   -R, --random-sort              sort by random hash of keys

Use gsort on *bsd.

 -R




On Thu, Mar 12, 2009 at 12:33 PM, stack <[email protected]> wrote:

> On Thu, Mar 12, 2009 at 8:05 AM, Mat Hofschen <[email protected]> wrote:
>
> > ...
> >
> > I can see that our hardware will not be sufficient. For now this is a
> > testlab setup and will have to be upgraded.
> >
>
> Unless you do as Jon suggests and split your cluster.  Apart from Jon's
> suggestion, you might also consider running the regionservers and datanodes
> together and tasktrackers elsewhere.
>
>
>
> > One more question to understand the scenario better:
> > I have 120 reduce jobs running on all nodes and there is only one node
> that
> > hosts the initial region. Then all 120 reduce jobs are trying to write to
> > this one machine?
>
>
> Yes.
>
>
> > What happens then if the region is split? Do some of the
> > Reduce Jobs notice that write ops go to a new region, or are they still
> > writing to the first region which then redirects traffic?
>
>
> All reducers notice the split and will write to the appropriate region.
>  For
> example, on first split, assuming your MR job sorted rows, half the load
> should go to the first region and the other half to the second.
>
> St.Ack
>

Reply via email to