Re: Is Hadoop Really the right framework for me?

Sandy Thu, 10 Jul 2008 15:46:06 -0700

Thank for the responses..

Lohit and Mahadev: this sounds fantastic; however, where may I got hadoop
0.18? I went to http://hadoop.apache.org/core/releases.html


But did not see a link for hadoop 0.18. After I did a brief search on
google, it did not seem that Hadoop has been officially released yes. If
this is indeed the case, when is the release date scheduled? In the
meantime, could you please point me in the direction on where to acquire it?
If it a better idea for me to wait for the release?

Thank you kindly.

-SM

On Thu, Jul 10, 2008 at 5:18 PM, lohit <[EMAIL PROTECTED]> wrote:

> Hello Sandy,
>
> If you are using hadoop 0.18, you can use NLineInputFormat input format to
> get you job done. What this says is give exactly one line for each mapper.
> In your mapper you might have to encode your keys something like
> <word:linenumber>
> So output from your mapper would be key/value pair as <word:linenumber>,1
> Reducer would sum up all word:linenumber and in your reduce funtion, you
> would have to extract the work, linenumber and its count. The delimiter ':'
> should not be part of your word though.
>
> You might want to take a look at the example usage of NLineInputFormat from
> this test src/test/org/apache/hadoop/mapred/lib/TestLineInputFormat.java
>
> HTH,
> Lohit
> ----- Original Message ----
> From: Sandy <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, July 10, 2008 2:47:21 PM
> Subject: Is Hadoop Really the right framework for me?
>
> Hello,
>
> I have been posting on the forums for a couple of weeks now, and I really
> appreciate all the help that I've been receiving. I am fairly new to Java,
> and even newer to the Hadoop framework. While I am sufficiently impressed
> with the Hadoop, quite a bit of the underlying functionality is masked to
> the user (which, while I understand is the point of a Map Reduce Framework,
> can be a touch frustrating for someone who is still trying to learn their
> way around), and the documentation is sometimes difficult to navigate. I
> have been thusfar unable to sufficiently find an answer to this question on
> my own.
>
> My goal is to implement a fairly simple map reduce algorithm. My question
> is, "Is Hadoop really the right framework to use for this algorithm?"
>
> I have one very large file containing multiple lines of text. I want to
> assign a mapper job to each line. Furthermore, the mapper needs to be able
> to know what line it is processing. If we were thinking about this in terms
> of the Word Count Example, let's say we have a modification where we want
> to
> just see where the words came from, rather than just the count of the
> words.
>
>
> For this example, we have the file:
>
> Hello World
> Hello Hadoop
> Goodbye Hadoop
>
>
> I want to assign a mapper to each line. Each mapper will emit a word and
> its
> corresponding line number. For this example, we would have three mappers,
> (call them m1, m2, and m3). Each mapper will emit the following:
>
> m1 emits:
> <"Hello", 1> <"World", 1>
>
> m2 emits:
> <"Hello", 2> <"Hadoop", 2>
>
> m3 emits:
> <"Goodbye",3> <"Hadoop", 3>
>
>
> My reduce function will count the number of words based on the -instances-
> of line numbers they have, which is necessary, because I wish to use the
> line numbers for another purpose.
>
>
> I have tried Hadoop Pipes, and the Hadoop Python interface. I am now
> looking
> at the Java interface, and am still puzzled how quite to implement this,
> mainly because I don't see how to assign mappers to lines of files, rather
> than to files themselves. From what I can see from the documentation,
> Hadoop
> seems to be more suitable for applications that deal multiple files rather
> than multiple lines. I want it to be able to spawn for any input file, a
> number of mappers corresponding to the number of lines. There can be a cap
> on the number of mappers spawned (e.g. 128) so that if the number of lines
> exceed the number of mappers, then the mappers can concurrently process
> lines until all lines are exhausted. I can't see a straightfoward way to do
> this using the Hadoop framework.
>
> Please keep in mind that I cannot put each line in its own separate file;
> the number of lines in my file is sufficiently large that this is really
> not
> a good idea.
>
>
> Given this information, is Hadoop really the right framework to use? If
> not,
> could you please suggest alternative frameworks? I am currently looking at
> Skynet and Erlang, though I am not too familiar with either.
>
> I would appreciate any feedback. Thank you for your time.
>
> Sincerely,
>
> -SM
>
>

Re: Is Hadoop Really the right framework for me?

Reply via email to