this really seems like the kind of query that doesn't lend itself to mapreduce very well... i'd probably do some kind of mapreduce abuse... (using mapreduce for distributed computation but for nothing else)
map: tokens = set of tokens in first line for each line: make set of tokens in this line tokens = intersection(tokens, tokens_this_line) print tokens combiner: same as map red: cat this way each mapper will reduce all of its input to one line of tokens found in all lines. then you can re-run this on the output until you get a small enough set that you can run the last job on one box. -mike On Sun, Nov 29, 2009 at 8:37 PM, Owen O'Malley <omal...@apache.org> wrote: > > On Nov 29, 2009, at 5:00 PM, James R. Leek wrote: > >> I want to use hadoop to discover if there is any token that appears in every >> line of a file. > > What I would do: > > map: > generate sorted list of tokens (dropping duplicates) for current line > if this is the first record: > previous = current token list > else: > iterate through both lists deleting any tokens from previous that aren't > in the current > > in map close: > for each token: > emit token, 1 > > in reduce: > if there are M, where M is the number of maps, values for the key: > emit token, null > > memory in map is limited to roughly double the size of each line, which in > most non-insane data sets is totally fine. Processing for each line is N lg N > in the number of tokens in that line. Everything else is linear in the size > of the answer. > > -- Owen > > >