Take a look at Sort.java in hadoop examples code. You should be able to use it 
for your case with minimal coding effort.

Typically when merging 2 sorted data-sets in hadoop you load the smaller one in 
the memory of each map-task and do a merge in the map() API. The job is setup 
(by the user) to run as map-only and is very efficient since it does not 
require copy/shuffle/reduce by the framework. Obviously this has the 
precondition that smaller one should fit in memory and also requires coding 
effort on your part :-)

You have the option of using SequenceFileFormat for storing your data which is 
a compressed binary format and takes less storage space. But when using your 
data outside hadoop you will have to do the necessary SequenceFile to Text 
conversion. Necessary classes for doing so are available in hadoop but still 
that's more work.

Regards
-...@nkur

----- Original Message -----
From: "Erik Forsberg" <forsb...@opera.com>
To: common-user@hadoop.apache.org
Sent: Thursday, September 3, 2009 1:24:02 AM GMT +05:30 Chennai, Kolkata, 
Mumbai, New Delhi
Subject: Efficient sort -u + merge, in Hadoop M/R?

Hi!

Let's assume an example use case where Apache's mod_usertrack 
generates randomly selected user id's that are stored in a cookie and
written to the log file.

I want to keep track of the number of daily, weekly, monthly,
quarterly, yearly users, as well as the total number of users since the
application was launched. 

I can do this rather fast on the Linux commandline by creating a file
for each day listing the unique UIDs, one per line, then use sort -u to
get the list of unique such users. To get a weekly count, I can create
a list of weekly users, then merge in each day's users with "sort -u
-m", which works well if both input files are already sorted.

This of course only works up to a rather small amount of data
before the runtime becomes a problem. Now I want to do this using
Hadoop and Map/Reduce. I'm sure this problem have been solved before,
and I'm now looking for hints from experienced people.

Is there perhaps already freely available java code that can do the
sorting/merging for me? 

Should I use some trick to take advantage of the fact that the
weekly/monthly/etc files are already sorted? 

Should I store the weekly/monthly/etc files in some hadoop:ish format
for better performance instead of keeping the textoutputformat?

Thanks,
\EF
-- 
Erik Forsberg <forsb...@opera.com>
Developer, Opera Software, http://www.opera.com/

Reply via email to