Hi! Let's assume an example use case where Apache's mod_usertrack generates randomly selected user id's that are stored in a cookie and written to the log file.
I want to keep track of the number of daily, weekly, monthly, quarterly, yearly users, as well as the total number of users since the application was launched. I can do this rather fast on the Linux commandline by creating a file for each day listing the unique UIDs, one per line, then use sort -u to get the list of unique such users. To get a weekly count, I can create a list of weekly users, then merge in each day's users with "sort -u -m", which works well if both input files are already sorted. This of course only works up to a rather small amount of data before the runtime becomes a problem. Now I want to do this using Hadoop and Map/Reduce. I'm sure this problem have been solved before, and I'm now looking for hints from experienced people. Is there perhaps already freely available java code that can do the sorting/merging for me? Should I use some trick to take advantage of the fact that the weekly/monthly/etc files are already sorted? Should I store the weekly/monthly/etc files in some hadoop:ish format for better performance instead of keeping the textoutputformat? Thanks, \EF -- Erik Forsberg <forsb...@opera.com> Developer, Opera Software, http://www.opera.com/