Re: Sorting ...

Robert Evans Thu, 26 May 2011 08:34:59 -0700

Also if you want something that is fairly fast and a lot less dev work to get 
going you might want to look at pig.  They can do a distributed order by that 
is fairly good.

--Bobby Evans

On 5/26/11 2:45 AM, "Luca Pireddu" <pire...@crs4.it> wrote:

On May 25, 2011 22:15:50 Mark question wrote:
> I'm using SequenceFileInputFormat, but then what to write in my mappers?
>
>   each mapper is taking a split from the SequenceInputFile then sort its
> split ?! I don't want that..
>
> Thanks,
> Mark
>
> On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu <pire...@crs4.it> wrote:
> > On May 25, 2011 01:43:22 Mark question wrote:
> > > Thanks Luca, but what other way to sort a directory of sequence files?
> > >
> > > I don't plan to write a sorting algorithm in mappers/reducers, but
> > > hoping to use the sequenceFile.sorter instead.
> > >
> > > Any ideas?
> > >
> > > Mark
> >

If you want to achieve a global sort, then look at how TeraSort does it:

http://sortbenchmark.org/YahooHadoop.pdf

The idea is to partition the data so that all keys in part[i] are < all keys
in part[i+1].  Each partition in individually sorted, so to read the data in
globally sorted order you simply have to traverse it starting from the first
partition and working your way to the last one.

If your keys are already what you want to sort by, then you don't even need a
mapper (just use the default identity map).

--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452

Re: Sorting ...

Reply via email to