Re: Identifying lines in map()

Mike Kendall Mon, 30 Nov 2009 09:47:51 -0800

this really seems like the kind of query that doesn't lend itself to
mapreduce very well...  i'd probably do some kind of mapreduce
abuse... (using mapreduce for distributed computation but for nothing
else)


map:
tokens = set of tokens in first line
for each line:
    make set of tokens in this line
    tokens = intersection(tokens, tokens_this_line)
print tokens

combiner: same as map

red: cat

this way each mapper will reduce all of its input to one line of
tokens found in all lines.  then you can re-run this on the output
until you get a small enough set that you can run the last job on one
box.

-mike

On Sun, Nov 29, 2009 at 8:37 PM, Owen O'Malley <omal...@apache.org> wrote:
>
> On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:
>
>> I want to use hadoop to discover if there is any token that appears in every 
>> line of a file.
>
> What I would do:
>
> map:
>   generate sorted list of tokens (dropping duplicates) for current line
>   if this is the first record:
>      previous = current token list
>   else:
>      iterate through both lists deleting any tokens from previous that aren't 
> in the current
>
> in map close:
>  for each token:
>     emit token, 1
>
> in reduce:
>   if there are M, where M is the number of maps, values for the key:
>     emit token, null
>
> memory in map is limited to roughly double the size of each line, which in 
> most non-insane data sets is totally fine. Processing for each line is N lg N 
> in the number of tokens in that line. Everything else is linear in the size 
> of the answer.
>
> -- Owen
>
>
>

Re: Identifying lines in map()

Reply via email to