On Fri, Jul 18, 2008 at 2:06 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > unless you have a gigantic number of items with the same id, this is > straightforward. have a mapper emit items of the form: > > key=id, value = type,timestamp
Or if you do have a large (by hadoop standards) number of items with the same id, use the timestamp + id for the key, emit one row for timestamp through timestamp + 5, and put a unique identifier in the row I think you can get a guaranteed-unique id from mapred.task.id (but check me on that), and just add a counter to that: ID type Timestamp A1 X 1215647404 A1 Y 1215647408 becomes 1215647404/a1, x, uniqueidX 1215647405/a1, x, uniqueidX 1215647406/a1, x, uniqueidX 1215647407/a1, x, uniqueidX 1215647408/a1, x, uniqueidX 1215647408/a1, y, uniqueidY 1215647409/a1, y, uniqueidY 1215647410/a1, y, uniqueidY etc If a key has a uniqueX, then write all the uniqueYs. Then the problem just becomes WordCount as a second pass. (Someone more clever than myself can probably do this in one pass...) Your mapper ends up spitting out 5x more rows, but your reducer has many fewer rows to keep in memory. At Hadoop scales, that might matter. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com