It's a common web log analysis situation. The original weblog is saved every hour on multiple servers. Now we would like the parsed log results to be saved one file an hour. How to make it?
In our MR job, the input is a directory with many files in many hours, let's say 4X files in X hours. if there are e.g. 10 Reducers, then all of the results would be partitioned into 10 files, each of which contains results in every hour. We would like the results to be save in X files, each of which contains only one-hour result. Since the input files could change, I can't even set the reducer number to be exactly X in the program.