Hadoop's use of the "map/reduce" term would suggest, to the J
programmer something like:

  v/ u"0

Or, maybe

  v/ u"1

But that's actually not a very good representation of what it's doing.

This is closer to what it's doing:
   v;.2 ;/:~<;.2 u;.2

(Oh, and early releases hadoop did not have the "reduce" step, they
were just a mechanism to farm jobs out across a bunch of different
machines.)

More specifically, it's set up to handle a unix pipeline (reading from
stdin, writing to stdout, implicitly operating on lines terminated by
ascii character 10), farming the job out across multiple machines for
the "map" operation, then sorting the result and farming that out
again across multiple machines for the "reduce" operation.

Also, "u" and "v" are expected to be java classes (and you can specify
which *.jar file contains those classes), though there are prebuilt
implementations which let you run any arbitrary unix command (for
example hadoop-streaming.jar).

(Of course, there are a lot of ornate details, most of which usually
get ignored (having to do with tuning the environment and/or catering
to somewhat arbitrary constraints of somewhat arbitrary programs).
Also, there are numerous versions and flavors, all slightly
incompatible.)

But what if the data you are processing is not "lines" but something
else? Files? Directories extracted from archives? Etc...?

Well, for that case you should probably set things up so that a "line"
is a descriptor of work to be done, and have the code being run
download/process/upload the real work while stdout becomes lines
briefly summarizing the work which was done. (This assumes you only
are "mapping".) Alternatively, if you are "reducing" then lines output
from the mapper would describe the next phase of work to be done (and
this assumes that sorting is a useful mechanism, in this context).
Though there are also constraints on resources which you have to not
run afoul of (too much time, too much memory, or too much disk and a
process can get shut down - and there's some variation on what those
limits are - but they are relatively small by default - maybe a
gigabyte of memory and 10 minutes of processing time, for example).

(On the flip side, once you are running code in a hadoop cluster, you
pretty much have unrestricted access to the machine -- as long as you
stay within your resource constraints - the machine instance gets
discarded when you are done with it.)

Finally, speed... when running a hadoop job, it's several seconds to
set up a small cluster, minutes to set up a larger cluster, I've not
pushed the boundaries enough to know what a really large job looks
like, and the good documentation seems to be mostly for early versions
of hadoop (capable of handling only 4000 machines for a cluster) but I
have noticed that some features documented for early versions are no
longer present [or, at least, are syntactically different] in more
recent versions.

But I guess that's pretty normal for all software projects. Makes it a
pain to use, though, when old documentation just plain doesn't work
and you can't find the relevant newer documentation. (And, sadly,
search engines seem to be getting worse and worse at locating relevant
documentation, as opposed to auto-generated spam .. though perhaps
that also has to do with my idea of relevance?)

Anyways, enough musing for now, time to get back to debugging...

Thanks,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to