Also if you want something that is fairly fast and a lot less dev work to get going you might want to look at pig. They can do a distributed order by that is fairly good.
--Bobby Evans On 5/26/11 2:45 AM, "Luca Pireddu" <pire...@crs4.it> wrote: On May 25, 2011 22:15:50 Mark question wrote: > I'm using SequenceFileInputFormat, but then what to write in my mappers? > > each mapper is taking a split from the SequenceInputFile then sort its > split ?! I don't want that.. > > Thanks, > Mark > > On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu <pire...@crs4.it> wrote: > > On May 25, 2011 01:43:22 Mark question wrote: > > > Thanks Luca, but what other way to sort a directory of sequence files? > > > > > > I don't plan to write a sorting algorithm in mappers/reducers, but > > > hoping to use the sequenceFile.sorter instead. > > > > > > Any ideas? > > > > > > Mark > > If you want to achieve a global sort, then look at how TeraSort does it: http://sortbenchmark.org/YahooHadoop.pdf The idea is to partition the data so that all keys in part[i] are < all keys in part[i+1]. Each partition in individually sorted, so to read the data in globally sorted order you simply have to traverse it starting from the first partition and working your way to the last one. If your keys are already what you want to sort by, then you don't even need a mapper (just use the default identity map). -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452