Take a look at Sort.java in hadoop examples code. You should be able to use it for your case with minimal coding effort.
Typically when merging 2 sorted data-sets in hadoop you load the smaller one in the memory of each map-task and do a merge in the map() API. The job is setup (by the user) to run as map-only and is very efficient since it does not require copy/shuffle/reduce by the framework. Obviously this has the precondition that smaller one should fit in memory and also requires coding effort on your part :-) You have the option of using SequenceFileFormat for storing your data which is a compressed binary format and takes less storage space. But when using your data outside hadoop you will have to do the necessary SequenceFile to Text conversion. Necessary classes for doing so are available in hadoop but still that's more work. Regards -...@nkur ----- Original Message ----- From: "Erik Forsberg" <forsb...@opera.com> To: common-user@hadoop.apache.org Sent: Thursday, September 3, 2009 1:24:02 AM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi Subject: Efficient sort -u + merge, in Hadoop M/R? Hi! Let's assume an example use case where Apache's mod_usertrack generates randomly selected user id's that are stored in a cookie and written to the log file. I want to keep track of the number of daily, weekly, monthly, quarterly, yearly users, as well as the total number of users since the application was launched. I can do this rather fast on the Linux commandline by creating a file for each day listing the unique UIDs, one per line, then use sort -u to get the list of unique such users. To get a weekly count, I can create a list of weekly users, then merge in each day's users with "sort -u -m", which works well if both input files are already sorted. This of course only works up to a rather small amount of data before the runtime becomes a problem. Now I want to do this using Hadoop and Map/Reduce. I'm sure this problem have been solved before, and I'm now looking for hints from experienced people. Is there perhaps already freely available java code that can do the sorting/merging for me? Should I use some trick to take advantage of the fact that the weekly/monthly/etc files are already sorted? Should I store the weekly/monthly/etc files in some hadoop:ish format for better performance instead of keeping the textoutputformat? Thanks, \EF -- Erik Forsberg <forsb...@opera.com> Developer, Opera Software, http://www.opera.com/