I would probably agree that it's typically not a good idea to add states to
distributed systems. Additionally, from a purist's perspective, this would
be a bit of hacking to the paradigm.

However, from a practical point of view, I think that it's a reasonable
trade-off between efficiency and complexity. It's not too difficult to have
a small set of mutable states being kept in-between iterations. And I think
a laarge number of iterative algorithms could benefit from this.

For the time being, we're thinking something like this:

RDD[Array[Double]] appended with an extra column that initializes to some
default value.

If the extra column in an iteration has the default value, it means either
something failed or it's the very first iteration, so we compute things
inefficiently. Otherwise, it has intermediate computational value, so we
can do efficient computation.


On Mon, Apr 21, 2014 at 11:15 AM, Marcelo Vanzin <van...@cloudera.com>wrote:

> Hi Sung,
>
> On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
> <coded...@cs.stanford.edu> wrote:
> > The goal is to keep an intermediate value per row in memory, which would
> > allow faster subsequent computations. I.e., computeSomething would
> depend on
> > the previous value from the previous computation.
>
> I think the fundamental problem here is that there is no "in memory
> state" of the sort you mention when you're talking about
> map/reduce-style workloads. There are three kinds of data that you can
> use to communicate between sub-tasks:
>
> - RDD input / output, i.e. the arguments and return values of the
> closures you pass to transformations
> - Broadcast variables
> - Accumulators
>
> In general, distributed algorithms should strive to be stateless,
> exactly because of issues like reliability and having to re-run
> computations (and communication/coordination in general being
> expensive). The last two in the list above are not generally targeted
> at the kind of state-keeping that you seem to be talking about.
>
> So if you make the result of "computeSomething()" the output of your
> map task, then you'll have access to it in the operations downstream
> from that map task. But you can't "store it in a variable in memory"
> and access it later, because that's not how the system works.
>
> In any case, I'm really not familiar with ML algorithms, but maybe you
> should take a look at MLLib.
>
>
> --
> Marcelo
>

Reply via email to