Hi, I have a setup where logs are periodically bundled up and dumped into hadoop dfs as large sequence file.
It works fine for all my map reduce jobs. Now i need to handle adhoc queries for pulling out logs based on user and time range. I really dont need a full indexer (like lucene) for this purpose. My first thought is to run a periodic mapreduce to generate a large text file sorted by user id. The text file will have (sequence file name, offset) to retrieve the logs .... I am guessing many of you ran into similar requirements... Any suggestions on doing this better? ishwar