On Thu, Jan 28, 2010 at 1:37 PM, Markus Weimer <[email protected]> wrote:
>
> >
> > How does network bandwidth come into play in a "local" solution?
>
> Data may not fit on one disk and must be streamed through the network
> to the learning algorithm. If the data does indeed fit onto one disk,
> the algorithm becomes disk bandwidth bound.
>

Ok, understand this part now, ok.


> There is no parallelism to be exploited: I'm doing SGD-style learning.
> As the parallelization thereof is a largely unsolved problem, the
> learning is strictly sequential.  The desire to run it on a hadoop
> cluster stems from the fact that data preprocessing and the
> application of the learned model is a perfect fit for it. It would be
> neat if the actual learning could be done on the cluster as well, if
> only on a single, carefully chosen node close to the data.
>

Well let me see what we would imagine is going on: your
data lives all over HDFS, because it's nice and big.  The
algorithm wants to run over the set in a big streamy fashion.

At any given point if it's done processing local stuff, it can
output 0.5GB of state and pick up that somewhere else to
continue, is that correct?

You clearly don't want to move your multi-TB dataset
around, but moving the 0.5GB model state around is
ok, yes?

It seems like what you'd want to do is pass that state
info around your cluster, sequentially using one node
at a time to process chunks of your data set, I'm just
not sure what sort of non-hacky way there is to do this
in Hadoop.  Simple hack: split up your set manually
into a bunch of smaller (small enough for one disk)
non-splittable files, and then have the same job get
repeated over and over again (with different input
sources), each time it finishes it outputs state to
HDFS, and each time it starts, the mapper slurps down
the state from HDFS.  This latter mini-shuffle is a
little inefficient (probably two remote copies are
done), but it's a fairly small amount of data that
is being transferred, and hopefully IO would no longer
be the bottleneck.

  -jake




> Thanks,
>
> Markus
>

Reply via email to