Re: the question about the common pc?

Steve Loughran Mon, 23 Feb 2009 03:14:56 -0800

Tim Wintle wrote:

On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:

I've been doing MapReduce work over small in-memory datasetsusing Erlang, which works very well in such a context.


I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like

mkfifo mypipe1
mkfifo mypipe2
awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&

  awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&

./get_lots_of_data | tee mypipe1 > mypipe2


(wait until it's done... or send a signal from the "get_lots_of_data"
process on completion if it's a cronjob)

sort -m map_out* | ./reducer > reduce_out


works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.


Dumbo provides py support under Hadoop:
 http://wiki.github.com/klbostee/dumbo
 https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why itcouldn't support the javax.script engine, with JavaScript workingwithout extra JAR files, groovy and jython once their JARs were stuck onthe classpath. Some work would probably be needed to make it easier touse these languages, and then there are the tests...

Re: the question about the common pc?

Reply via email to