i promise to make a review of this by next monday. I looked briefly and had
some suggestions, I think it might be ok. My only concern is what i have
already said -- once we start mapping aggregate, there's no reason not to
map other engine specific capabilities, which are vast. At this point
dilemma is, no matter what we do we are losing coherency: if we map it all,
then other engines will have trouble supporting all of it. If we don't map
it all, then we are forcing capability reduction compared to what the
engine actually can do.

It is obvious to me that all-reduce aggregate will make a lot of sense --
even if it means math checkpoint. but then where do we stop in mapping
those. E.g. do we do fold? cartesian? And what is that true reason we are
remapping everything if it is already natively available? etc. etc. For
myself, I still haven't figured a good answer to those .

On Tue, Nov 11, 2014 at 1:18 PM, Gokhan Capan <gkhn...@gmail.com> wrote:

> So the alternatives are:
>
> 1- mapBlock to a matrix whose all rows-but-the first are empty, then
> aggregate
> 2- depend on a backend
>
> 1 is obviously OK.
>
> I don't like the idea of depending on a backend since SGD is a generic loss
> minimization, on which other algorithms will possibly depend.
>
> In this context, client-side aggregation is not an overhead, but even if it
> happens to be so, it doesn't have to be a client-side aggregate at all.
>
> Alternative to 1, I am thinking of at least having an aggregation
> operation, which will return an accumulated value anyway, and shouldn't
> affect algebra optimizations.
>
> I quickly implemented a naive one (supporting only Spark- I know I said
> that I don't like depending on a backend, but at least the backends-wide
> interface is consistent, and as a client, I still don't have to deal with
> Spark primitives directly).
>
> Is this nice enough? Is it too bad to have in the DSL?
> https://github.com/gcapan/mahout/compare/accumulateblocks
>
> Best
>
> Gokhan
>
> On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > Oh. algorithm actually collects the vectors and runs another cycle in the
> > client!
> >
> > Still, technically, you can collect almost-empty blocks to the client
> > (since they are mostly empty, it won't cause THAT huge overhead compared
> to
> > collecting single vectors, after all, how many partitions are we talking
> > about? 1000? ).
> >
> > On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> >>
> >>
> >> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
> >>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
> >>> implement SGD, and a regularized least squares solution for linear
> >>> regression (can easily be extended to other GLMs, too).
> >>>
> >>> How the algorithm works is as follows:
> >>> 1. Split data into partitions of T examples
> >>> 2. in parallel, for each partition:
> >>>    2.0. shuffle partition
> >>>    2.1. initialize parameter vector
> >>>    2.2. for each example in the shuffled partition
> >>>        2.2.1 update the parameter vector
> >>> 3. Aggregate all the parameter vectors and return
> >>>
> >>
> >> I guess technically it is possible (transform each block to a
> >> SparseRowMatrix or SparseMatrix with only first valid row) and then
> invoke
> >> colSums() or colMeans() (whatever aggregate means).
> >>
> >> However, i am not sure it is worth the ugliness. isn't it easier to
> >> declare these things quasi-algebraic and just do direct spark calls on
> the
> >> matrix RDD (map, aggregate)?
> >>
> >> The real danger is to introduce non-algebra things into algebra so that
> >> the rest of the algebra doesn't optimize any more.
> >>
> >>
> >
>

Reply via email to