unless you have a gigantic number of items with the same id, this is
straightforward.  have a mapper emit items of the form:

key=id, value = type,timestamp

and your reducer will then see all ids that have the same value together.
it is then a simple matter to process all items with the same id.  for
example, you could simply read them into a list and work on them in any
manner you see fit.

(note that hadoop is perfectly fine at dealing with multi-line items.  all
you need do is make sure that the items you want to process together all
share the same key)

Miles

2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>:

> well here is the problem I'm trying to solve,
>
> I have a data set that looks like this:
>
> ID    type   Timestamp
>
> A1    X   1215647404
> A2    X   1215647405
> A3    X   1215647406
> A1   Y   1215647409
>
> I want to count how many A1 Y, show up within 5 seconds of an A1 X
>
> I was planning to have the data sorted by ID then timestamp,
> then read it backwards,  (or have it sorted by reverse timestamp)
>
> go through it cashing all Y's for the same ID for 5 seconds to either find
> a matching X or not.
>
> the results don't need to be 100% accurate.
>
> so if hadoop gives the same file with the same lines in order then this
> will work.
>
> seems hadoop is really good at solving problems that depend on 1 line at a
> time? but not multi lines?
>
> hadoop has to get data in order, and be able to work on multi lines,
> otherwise how can it be setting records in data sorts.
>
> I'd appreciate other suggestions to go about doing this.
>
> Jim R. Wilson wrote:
>
>> does wordcount get the lines in order? or are they random? can i have
>>> hadoop return them in reverse order?
>>>
>>>
>>
>> You can't really depend on the order that the lines are given - it's
>> best to think of them as random.  The purpose of MapReduce/Hadoop is
>> to distribute a problem among a number of cooperating nodes.
>>
>> The idea is that any given line can be interpreted separately,
>> completely independent of any other line.  So in wordcount, this makes
>> sense.  For example, say you and I are nodes. Each of us gets half the
>> lines in a file and we can count the words we see and report on them -
>> it doesn't matter what order we're given the lines, or which lines
>> we're given, or even whether we get the same number of lines (if
>> you're faster at it, or maybe you get shorter lines, you may get more
>> lines to process in the interest of saving time).
>>
>> So if the project you're working on requires getting the lines in a
>> particular order, then you probably need to rethink your approach. It
>> may be that hadoop isn't right for your problem, or maybe that the
>> problem just needs to be attacked in a different way.  Without knowing
>> more about what you're trying to achieve, I can't offer any specifics.
>>
>> Good luck!
>>
>> -- Jim
>>
>> On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
>> <[EMAIL PROTECTED]> wrote:
>>
>>
>>> I have a program based on wordcount.java
>>> and I have files that are smaller than 64mb files (so i believe each file
>>> is
>>> one task )
>>>
>>> do does wordcount get the lines in order? or are they random? can i have
>>> hadoop return them in reverse order?
>>>
>>> Jim R. Wilson wrote:
>>>
>>>
>>>> It sounds to me like you're talking about hadoop streaming (correct me
>>>> if I'm wrong there).  In that case, there's really no "order" to the
>>>> lines being doled out as I understand it.  Any given line could be
>>>> handed to any given mapper task running on any given node.
>>>>
>>>> I may be wrong, of course, someone closer to the project could give
>>>> you the right answer in that case.
>>>>
>>>> -- Jim R. Wilson (jimbojw)
>>>>
>>>> On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
>>>> <[EMAIL PROTECTED]> wrote:
>>>>
>>>>
>>>>
>>>>> is there a way to have hadoop hand over the lines of a file backwards
>>>>> to
>>>>> my
>>>>> mapper ?
>>>>>
>>>>> as in give the last line first.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Reply via email to