Hi!

Let's assume an example use case where Apache's mod_usertrack 
generates randomly selected user id's that are stored in a cookie and
written to the log file.

I want to keep track of the number of daily, weekly, monthly,
quarterly, yearly users, as well as the total number of users since the
application was launched. 

I can do this rather fast on the Linux commandline by creating a file
for each day listing the unique UIDs, one per line, then use sort -u to
get the list of unique such users. To get a weekly count, I can create
a list of weekly users, then merge in each day's users with "sort -u
-m", which works well if both input files are already sorted.

This of course only works up to a rather small amount of data
before the runtime becomes a problem. Now I want to do this using
Hadoop and Map/Reduce. I'm sure this problem have been solved before,
and I'm now looking for hints from experienced people.

Is there perhaps already freely available java code that can do the
sorting/merging for me? 

Should I use some trick to take advantage of the fact that the
weekly/monthly/etc files are already sorted? 

Should I store the weekly/monthly/etc files in some hadoop:ish format
for better performance instead of keeping the textoutputformat?

Thanks,
\EF
-- 
Erik Forsberg <forsb...@opera.com>
Developer, Opera Software, http://www.opera.com/

Reply via email to