Hello, I have been posting on the forums for a couple of weeks now, and I really appreciate all the help that I've been receiving. I am fairly new to Java, and even newer to the Hadoop framework. While I am sufficiently impressed with the Hadoop, quite a bit of the underlying functionality is masked to the user (which, while I understand is the point of a Map Reduce Framework, can be a touch frustrating for someone who is still trying to learn their way around), and the documentation is sometimes difficult to navigate. I have been thusfar unable to sufficiently find an answer to this question on my own.
My goal is to implement a fairly simple map reduce algorithm. My question is, "Is Hadoop really the right framework to use for this algorithm?" I have one very large file containing multiple lines of text. I want to assign a mapper job to each line. Furthermore, the mapper needs to be able to know what line it is processing. If we were thinking about this in terms of the Word Count Example, let's say we have a modification where we want to just see where the words came from, rather than just the count of the words. For this example, we have the file: Hello World Hello Hadoop Goodbye Hadoop I want to assign a mapper to each line. Each mapper will emit a word and its corresponding line number. For this example, we would have three mappers, (call them m1, m2, and m3). Each mapper will emit the following: m1 emits: <"Hello", 1> <"World", 1> m2 emits: <"Hello", 2> <"Hadoop", 2> m3 emits: <"Goodbye",3> <"Hadoop", 3> My reduce function will count the number of words based on the -instances- of line numbers they have, which is necessary, because I wish to use the line numbers for another purpose. I have tried Hadoop Pipes, and the Hadoop Python interface. I am now looking at the Java interface, and am still puzzled how quite to implement this, mainly because I don't see how to assign mappers to lines of files, rather than to files themselves. From what I can see from the documentation, Hadoop seems to be more suitable for applications that deal multiple files rather than multiple lines. I want it to be able to spawn for any input file, a number of mappers corresponding to the number of lines. There can be a cap on the number of mappers spawned (e.g. 128) so that if the number of lines exceed the number of mappers, then the mappers can concurrently process lines until all lines are exhausted. I can't see a straightfoward way to do this using the Hadoop framework. Please keep in mind that I cannot put each line in its own separate file; the number of lines in my file is sufficiently large that this is really not a good idea. Given this information, is Hadoop really the right framework to use? If not, could you please suggest alternative frameworks? I am currently looking at Skynet and Erlang, though I am not too familiar with either. I would appreciate any feedback. Thank you for your time. Sincerely, -SM