I think src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java is what you want.
Mahadev > -----Original Message----- > From: Michael Bieniosek [mailto:[EMAIL PROTECTED] > Sent: Thursday, July 10, 2008 3:09 PM > To: core-user@hadoop.apache.org; Sandy > Subject: Re: Is Hadoop Really the right framework for me? > > My understanding is that Hadoop doesn't know where the line breaks are > when it divides up your file, so each mapper will get some equally-sized > chunk of file containing some number of lines. It then does some patching > so that you get only whole lines for each mapper, but this does means that > 1) you can't guarantee that each map task will contain exactly one line > (though you can set the number of mappers high enough so that all mappers > get zero or one lines), and 2) you can't get the line numbers back. > > -Michael > > On 7/10/08 2:47 PM, "Sandy" <[EMAIL PROTECTED]> wrote: > > Hello, > > I have been posting on the forums for a couple of weeks now, and I really > appreciate all the help that I've been receiving. I am fairly new to Java, > and even newer to the Hadoop framework. While I am sufficiently impressed > with the Hadoop, quite a bit of the underlying functionality is masked to > the user (which, while I understand is the point of a Map Reduce > Framework, > can be a touch frustrating for someone who is still trying to learn their > way around), and the documentation is sometimes difficult to navigate. I > have been thusfar unable to sufficiently find an answer to this question > on > my own. > > My goal is to implement a fairly simple map reduce algorithm. My question > is, "Is Hadoop really the right framework to use for this algorithm?" > > I have one very large file containing multiple lines of text. I want to > assign a mapper job to each line. Furthermore, the mapper needs to be able > to know what line it is processing. If we were thinking about this in > terms > of the Word Count Example, let's say we have a modification where we want > to > just see where the words came from, rather than just the count of the > words. > > > For this example, we have the file: > > Hello World > Hello Hadoop > Goodbye Hadoop > > > I want to assign a mapper to each line. Each mapper will emit a word and > its > corresponding line number. For this example, we would have three mappers, > (call them m1, m2, and m3). Each mapper will emit the following: > > m1 emits: > <"Hello", 1> <"World", 1> > > m2 emits: > <"Hello", 2> <"Hadoop", 2> > > m3 emits: > <"Goodbye",3> <"Hadoop", 3> > > > My reduce function will count the number of words based on the -instances- > of line numbers they have, which is necessary, because I wish to use the > line numbers for another purpose. > > > I have tried Hadoop Pipes, and the Hadoop Python interface. I am now > looking > at the Java interface, and am still puzzled how quite to implement this, > mainly because I don't see how to assign mappers to lines of files, rather > than to files themselves. From what I can see from the documentation, > Hadoop > seems to be more suitable for applications that deal multiple files rather > than multiple lines. I want it to be able to spawn for any input file, a > number of mappers corresponding to the number of lines. There can be a cap > on the number of mappers spawned (e.g. 128) so that if the number of lines > exceed the number of mappers, then the mappers can concurrently process > lines until all lines are exhausted. I can't see a straightfoward way to > do > this using the Hadoop framework. > > Please keep in mind that I cannot put each line in its own separate file; > the number of lines in my file is sufficiently large that this is really > not > a good idea. > > > Given this information, is Hadoop really the right framework to use? If > not, > could you please suggest alternative frameworks? I am currently looking at > Skynet and Erlang, though I am not too familiar with either. > > I would appreciate any feedback. Thank you for your time. > > Sincerely, > > -SM