I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it.
1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more "real". A whole heap of thanks in advance, John