Hello,

I have been posting on the forums for a couple of weeks now, and I really
appreciate all the help that I've been receiving. I am fairly new to Java,
and even newer to the Hadoop framework. While I am sufficiently impressed
with the Hadoop, quite a bit of the underlying functionality is masked to
the user (which, while I understand is the point of a Map Reduce Framework,
can be a touch frustrating for someone who is still trying to learn their
way around), and the documentation is sometimes difficult to navigate. I
have been thusfar unable to sufficiently find an answer to this question on
my own.

My goal is to implement a fairly simple map reduce algorithm. My question
is, "Is Hadoop really the right framework to use for this algorithm?"

I have one very large file containing multiple lines of text. I want to
assign a mapper job to each line. Furthermore, the mapper needs to be able
to know what line it is processing. If we were thinking about this in terms
of the Word Count Example, let's say we have a modification where we want to
just see where the words came from, rather than just the count of the words.


For this example, we have the file:

Hello World
Hello Hadoop
Goodbye Hadoop


I want to assign a mapper to each line. Each mapper will emit a word and its
corresponding line number. For this example, we would have three mappers,
(call them m1, m2, and m3). Each mapper will emit the following:

m1 emits:
<"Hello", 1> <"World", 1>

m2 emits:
<"Hello", 2> <"Hadoop", 2>

m3 emits:
<"Goodbye",3> <"Hadoop", 3>


My reduce function will count the number of words based on the -instances-
of line numbers they have, which is necessary, because I wish to use the
line numbers for another purpose.


I have tried Hadoop Pipes, and the Hadoop Python interface. I am now looking
at the Java interface, and am still puzzled how quite to implement this,
mainly because I don't see how to assign mappers to lines of files, rather
than to files themselves. From what I can see from the documentation, Hadoop
seems to be more suitable for applications that deal multiple files rather
than multiple lines. I want it to be able to spawn for any input file, a
number of mappers corresponding to the number of lines. There can be a cap
on the number of mappers spawned (e.g. 128) so that if the number of lines
exceed the number of mappers, then the mappers can concurrently process
lines until all lines are exhausted. I can't see a straightfoward way to do
this using the Hadoop framework.

Please keep in mind that I cannot put each line in its own separate file;
the number of lines in my file is sufficiently large that this is really not
a good idea.


Given this information, is Hadoop really the right framework to use? If not,
could you please suggest alternative frameworks? I am currently looking at
Skynet and Erlang, though I am not too familiar with either.

I would appreciate any feedback. Thank you for your time.

Sincerely,

-SM

Reply via email to