unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form:
key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id. for example, you could simply read them into a list and work on them in any manner you see fit. (note that hadoop is perfectly fine at dealing with multi-line items. all you need do is make sure that the items you want to process together all share the same key) Miles 2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>: > well here is the problem I'm trying to solve, > > I have a data set that looks like this: > > ID type Timestamp > > A1 X 1215647404 > A2 X 1215647405 > A3 X 1215647406 > A1 Y 1215647409 > > I want to count how many A1 Y, show up within 5 seconds of an A1 X > > I was planning to have the data sorted by ID then timestamp, > then read it backwards, (or have it sorted by reverse timestamp) > > go through it cashing all Y's for the same ID for 5 seconds to either find > a matching X or not. > > the results don't need to be 100% accurate. > > so if hadoop gives the same file with the same lines in order then this > will work. > > seems hadoop is really good at solving problems that depend on 1 line at a > time? but not multi lines? > > hadoop has to get data in order, and be able to work on multi lines, > otherwise how can it be setting records in data sorts. > > I'd appreciate other suggestions to go about doing this. > > Jim R. Wilson wrote: > >> does wordcount get the lines in order? or are they random? can i have >>> hadoop return them in reverse order? >>> >>> >> >> You can't really depend on the order that the lines are given - it's >> best to think of them as random. The purpose of MapReduce/Hadoop is >> to distribute a problem among a number of cooperating nodes. >> >> The idea is that any given line can be interpreted separately, >> completely independent of any other line. So in wordcount, this makes >> sense. For example, say you and I are nodes. Each of us gets half the >> lines in a file and we can count the words we see and report on them - >> it doesn't matter what order we're given the lines, or which lines >> we're given, or even whether we get the same number of lines (if >> you're faster at it, or maybe you get shorter lines, you may get more >> lines to process in the interest of saving time). >> >> So if the project you're working on requires getting the lines in a >> particular order, then you probably need to rethink your approach. It >> may be that hadoop isn't right for your problem, or maybe that the >> problem just needs to be attacked in a different way. Without knowing >> more about what you're trying to achieve, I can't offer any specifics. >> >> Good luck! >> >> -- Jim >> >> On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi >> <[EMAIL PROTECTED]> wrote: >> >> >>> I have a program based on wordcount.java >>> and I have files that are smaller than 64mb files (so i believe each file >>> is >>> one task ) >>> >>> do does wordcount get the lines in order? or are they random? can i have >>> hadoop return them in reverse order? >>> >>> Jim R. Wilson wrote: >>> >>> >>>> It sounds to me like you're talking about hadoop streaming (correct me >>>> if I'm wrong there). In that case, there's really no "order" to the >>>> lines being doled out as I understand it. Any given line could be >>>> handed to any given mapper task running on any given node. >>>> >>>> I may be wrong, of course, someone closer to the project could give >>>> you the right answer in that case. >>>> >>>> -- Jim R. Wilson (jimbojw) >>>> >>>> On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi >>>> <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> >>>>> is there a way to have hadoop hand over the lines of a file backwards >>>>> to >>>>> my >>>>> mapper ? >>>>> >>>>> as in give the last line first. >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.