Hi Jake > Well let me see what we would imagine is going on: your > data lives all over HDFS, because it's nice and big. The > algorithm wants to run over the set in a big streamy fashion.
Yes, that sums it up nicely. The important part is to be able to stream over the data several times. > At any given point if it's done processing local stuff, it can > output 0.5GB of state and pick up that somewhere else to > continue, is that correct? Yes, with some mild constraints on "somewhere" that could be handled by writing my own InputFormat and InputSplit classes. > You clearly don't want to move your multi-TB dataset > around, but moving the 0.5GB model state around is > ok, yes? Yes! That is an awesome idea to minimize the I/O on the system. However, it does not yet address the issue of multiple passes over the data. But that can easily be done by handing around the 0.5GB another time. I could store the 0.5GB on HDFS, where it is read from the mapper in setup(), updated in map() over the data and stored again in cleanup(), together with some housekeeping data about which InputSplit was processed. The driver program could then orchestrate the different mappers and manage the global state of this procedure. Now I only need to figure out how to do this ;-) Thanks again, Markus
