On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
> I've been doing MapReduce work over small in-memory datasets 
> using Erlang,  which works very well in such a context.

I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like

> mkfifo mypipe1
> mkfifo mypipe2
> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
  awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
> ./get_lots_of_data | tee mypipe1 > mypipe2

(wait until it's done... or send a signal from the "get_lots_of_data"
process on completion if it's a cronjob)

> sort -m map_out* | ./reducer > reduce_out

works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.

Tim Wintle

Reply via email to