Much thanks! I am going to take a look. For the meantime I think there is a work around.
>From what it appears, at least from running hadoop locally, the mappers are assigned to each line of a single file by default in the WordCount.java file. If one just cares about establishing the uniqueness of each line, w/o needing the whole specific numbering (line 1, 2, etc. v. 3332 34234 42323) one can just use the key value associated with that mapper (since it just gives the offset). Since the offset is always increasing, and since a mapper is always attached to one line, there is no worry about uniqueness. Of course, this is dependent on the fact that a mapper will be attached to one line. Since it works (or appears to be working) on a local run of hadoop, I think I can guarantee that a mapper will map to a single line in a distributed run of hadoop, though all of this is grey-box speculation on my part. However, considering my lack of understanding of how hadoop may actually work, I wonder if this is a guarantee I can safely make. It is the "guarantee" that I am curious about; even if it works on some files of a certain size, I wonder if it will work on files of arbitrarily large size? I would love to hear the insight of some of the more experienced users on this matter. Thanks again, -SM On Thu, Jul 10, 2008 at 6:50 PM, lohit <[EMAIL PROTECTED]> wrote: > Its not released yet. There are 2 options > 1. download the un-released 0.18 branch from here > http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 > svn co > http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18branch-0.18 > > 2. get the NLineInputFormat.java from > http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18/src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java > copy it to your .....mapred/lib directory, rebuild everything and try it > out. I assume it should work, but I havent tried it out yet. > > Thanks, > Lohit > > ----- Original Message ---- > From: Sandy <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, July 10, 2008 3:45:34 PM > Subject: Re: Is Hadoop Really the right framework for me? > > Thank for the responses.. > > Lohit and Mahadev: this sounds fantastic; however, where may I got hadoop > 0.18? I went to http://hadoop.apache.org/core/releases.html > > But did not see a link for hadoop 0.18. After I did a brief search on > google, it did not seem that Hadoop has been officially released yes. If > this is indeed the case, when is the release date scheduled? In the > meantime, could you please point me in the direction on where to acquire > it? > If it a better idea for me to wait for the release? > > Thank you kindly. > > -SM > > On Thu, Jul 10, 2008 at 5:18 PM, lohit <[EMAIL PROTECTED]> wrote: > > > Hello Sandy, > > > > If you are using hadoop 0.18, you can use NLineInputFormat input format > to > > get you job done. What this says is give exactly one line for each > mapper. > > In your mapper you might have to encode your keys something like > > <word:linenumber> > > So output from your mapper would be key/value pair as <word:linenumber>,1 > > Reducer would sum up all word:linenumber and in your reduce funtion, you > > would have to extract the work, linenumber and its count. The delimiter > ':' > > should not be part of your word though. > > > > You might want to take a look at the example usage of NLineInputFormat > from > > this test src/test/org/apache/hadoop/mapred/lib/TestLineInputFormat.java > > > > HTH, > > Lohit > > ----- Original Message ---- > > From: Sandy <[EMAIL PROTECTED]> > > To: core-user@hadoop.apache.org > > Sent: Thursday, July 10, 2008 2:47:21 PM > > Subject: Is Hadoop Really the right framework for me? > > > > Hello, > > > > I have been posting on the forums for a couple of weeks now, and I really > > appreciate all the help that I've been receiving. I am fairly new to > Java, > > and even newer to the Hadoop framework. While I am sufficiently impressed > > with the Hadoop, quite a bit of the underlying functionality is masked to > > the user (which, while I understand is the point of a Map Reduce > Framework, > > can be a touch frustrating for someone who is still trying to learn their > > way around), and the documentation is sometimes difficult to navigate. I > > have been thusfar unable to sufficiently find an answer to this question > on > > my own. > > > > My goal is to implement a fairly simple map reduce algorithm. My question > > is, "Is Hadoop really the right framework to use for this algorithm?" > > > > I have one very large file containing multiple lines of text. I want to > > assign a mapper job to each line. Furthermore, the mapper needs to be > able > > to know what line it is processing. If we were thinking about this in > terms > > of the Word Count Example, let's say we have a modification where we want > > to > > just see where the words came from, rather than just the count of the > > words. > > > > > > For this example, we have the file: > > > > Hello World > > Hello Hadoop > > Goodbye Hadoop > > > > > > I want to assign a mapper to each line. Each mapper will emit a word and > > its > > corresponding line number. For this example, we would have three mappers, > > (call them m1, m2, and m3). Each mapper will emit the following: > > > > m1 emits: > > <"Hello", 1> <"World", 1> > > > > m2 emits: > > <"Hello", 2> <"Hadoop", 2> > > > > m3 emits: > > <"Goodbye",3> <"Hadoop", 3> > > > > > > My reduce function will count the number of words based on the > -instances- > > of line numbers they have, which is necessary, because I wish to use the > > line numbers for another purpose. > > > > > > I have tried Hadoop Pipes, and the Hadoop Python interface. I am now > > looking > > at the Java interface, and am still puzzled how quite to implement this, > > mainly because I don't see how to assign mappers to lines of files, > rather > > than to files themselves. From what I can see from the documentation, > > Hadoop > > seems to be more suitable for applications that deal multiple files > rather > > than multiple lines. I want it to be able to spawn for any input file, a > > number of mappers corresponding to the number of lines. There can be a > cap > > on the number of mappers spawned (e.g. 128) so that if the number of > lines > > exceed the number of mappers, then the mappers can concurrently process > > lines until all lines are exhausted. I can't see a straightfoward way to > do > > this using the Hadoop framework. > > > > Please keep in mind that I cannot put each line in its own separate file; > > the number of lines in my file is sufficiently large that this is really > > not > > a good idea. > > > > > > Given this information, is Hadoop really the right framework to use? If > > not, > > could you please suggest alternative frameworks? I am currently looking > at > > Skynet and Erlang, though I am not too familiar with either. > > > > I would appreciate any feedback. Thank you for your time. > > > > Sincerely, > > > > -SM > > > > > >