Hi Jake

> Well let me see what we would imagine is going on: your
> data lives all over HDFS, because it's nice and big.  The
> algorithm wants to run over the set in a big streamy fashion.

Yes, that sums it up nicely. The important part is to be able to
stream over the data several times.

> At any given point if it's done processing local stuff, it can
> output 0.5GB of state and pick up that somewhere else to
> continue, is that correct?

Yes, with some mild constraints on "somewhere" that could be handled
by writing my own InputFormat and InputSplit classes.

> You clearly don't want to move your multi-TB dataset
> around, but moving the 0.5GB model state around is
> ok, yes?

Yes! That is an awesome idea to minimize the I/O on the system.
However, it does not yet address the issue of multiple passes over the
data. But that can easily be done by handing around the 0.5GB another
time.

I could store the 0.5GB on HDFS, where it is read from the mapper in
setup(), updated in map() over the data and stored again in cleanup(),
together with some housekeeping data about which InputSplit was
processed. The driver program could then orchestrate the different
mappers and manage the global state of this procedure. Now I only need
to figure out how to do this ;-)

Thanks again,

Markus

Reply via email to