On Fri, Jul 18, 2008 at 2:06 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> unless you have a gigantic number of items with the same id, this is
> straightforward.  have a mapper emit items of the form:
>
> key=id, value = type,timestamp

Or if you do have a large (by hadoop standards) number of items with
the same id, use the timestamp + id for the key, emit one row for
timestamp through timestamp + 5, and put a unique identifier in the
row I think you can get a guaranteed-unique id from mapred.task.id
(but check me on that), and just add a counter to that:

ID    type   Timestamp

A1    X   1215647404
A1   Y   1215647408

becomes

1215647404/a1, x, uniqueidX
1215647405/a1, x, uniqueidX
1215647406/a1, x, uniqueidX
1215647407/a1, x, uniqueidX
1215647408/a1, x, uniqueidX
1215647408/a1, y, uniqueidY
1215647409/a1, y, uniqueidY
1215647410/a1, y, uniqueidY
etc

If a key has a uniqueX, then write all the uniqueYs.  Then the problem
just becomes WordCount as a second pass.  (Someone more clever than
myself can probably do this in one pass...)

Your mapper ends up spitting out 5x more rows, but your reducer has
many fewer rows to keep in memory.  At Hadoop scales, that might
matter.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Reply via email to