So you are following #2, which is good. #1 seems a bit like a hack. For a long time to come we will have to add things to the DSL if it is to be kept engine independent. Yours looks pretty general and simple.
Are you familiar with the existing Mahout aggregate methods? They show up in the SGDHelper.java and other places in legacy code. I don’t know much about them but they seem to be a pre-functional programming attempt at this kind of thing. It looks like you are proposing a replacement for those based on rdd.aggregate, if so that would be very useful. For one thing it looks like the old aggregate was not parallel, rdd.aggregate is. On Nov 11, 2014, at 1:18 PM, Gokhan Capan <gkhn...@gmail.com> wrote: So the alternatives are: 1- mapBlock to a matrix whose all rows-but-the first are empty, then aggregate 2- depend on a backend 1 is obviously OK. I don't like the idea of depending on a backend since SGD is a generic loss minimization, on which other algorithms will possibly depend. In this context, client-side aggregation is not an overhead, but even if it happens to be so, it doesn't have to be a client-side aggregate at all. Alternative to 1, I am thinking of at least having an aggregation operation, which will return an accumulated value anyway, and shouldn't affect algebra optimizations. I quickly implemented a naive one (supporting only Spark- I know I said that I don't like depending on a backend, but at least the backends-wide interface is consistent, and as a client, I still don't have to deal with Spark primitives directly). Is this nice enough? Is it too bad to have in the DSL? https://github.com/gcapan/mahout/compare/accumulateblocks Best Gokhan On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Oh. algorithm actually collects the vectors and runs another cycle in the > client! > > Still, technically, you can collect almost-empty blocks to the client > (since they are mostly empty, it won't cause THAT huge overhead compared to > collecting single vectors, after all, how many partitions are we talking > about? 1000? ). > > On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> >> >> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote: >> >>> Hi, >>> >>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( >>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to >>> implement SGD, and a regularized least squares solution for linear >>> regression (can easily be extended to other GLMs, too). >>> >>> How the algorithm works is as follows: >>> 1. Split data into partitions of T examples >>> 2. in parallel, for each partition: >>> 2.0. shuffle partition >>> 2.1. initialize parameter vector >>> 2.2. for each example in the shuffled partition >>> 2.2.1 update the parameter vector >>> 3. Aggregate all the parameter vectors and return >>> >> >> I guess technically it is possible (transform each block to a >> SparseRowMatrix or SparseMatrix with only first valid row) and then invoke >> colSums() or colMeans() (whatever aggregate means). >> >> However, i am not sure it is worth the ugliness. isn't it easier to >> declare these things quasi-algebraic and just do direct spark calls on the >> matrix RDD (map, aggregate)? >> >> The real danger is to introduce non-algebra things into algebra so that >> the rest of the algebra doesn't optimize any more. >> >> >