So the alternatives are: 1- mapBlock to a matrix whose all rows-but-the first are empty, then aggregate 2- depend on a backend
1 is obviously OK. I don't like the idea of depending on a backend since SGD is a generic loss minimization, on which other algorithms will possibly depend. In this context, client-side aggregation is not an overhead, but even if it happens to be so, it doesn't have to be a client-side aggregate at all. Alternative to 1, I am thinking of at least having an aggregation operation, which will return an accumulated value anyway, and shouldn't affect algebra optimizations. I quickly implemented a naive one (supporting only Spark- I know I said that I don't like depending on a backend, but at least the backends-wide interface is consistent, and as a client, I still don't have to deal with Spark primitives directly). Is this nice enough? Is it too bad to have in the DSL? https://github.com/gcapan/mahout/compare/accumulateblocks Best Gokhan On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Oh. algorithm actually collects the vectors and runs another cycle in the > client! > > Still, technically, you can collect almost-empty blocks to the client > (since they are mostly empty, it won't cause THAT huge overhead compared to > collecting single vectors, after all, how many partitions are we talking > about? 1000? ). > > On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> >> >> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote: >> >>> Hi, >>> >>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( >>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to >>> implement SGD, and a regularized least squares solution for linear >>> regression (can easily be extended to other GLMs, too). >>> >>> How the algorithm works is as follows: >>> 1. Split data into partitions of T examples >>> 2. in parallel, for each partition: >>> 2.0. shuffle partition >>> 2.1. initialize parameter vector >>> 2.2. for each example in the shuffled partition >>> 2.2.1 update the parameter vector >>> 3. Aggregate all the parameter vectors and return >>> >> >> I guess technically it is possible (transform each block to a >> SparseRowMatrix or SparseMatrix with only first valid row) and then invoke >> colSums() or colMeans() (whatever aggregate means). >> >> However, i am not sure it is worth the ugliness. isn't it easier to >> declare these things quasi-algebraic and just do direct spark calls on the >> matrix RDD (map, aggregate)? >> >> The real danger is to introduce non-algebra things into algebra so that >> the rest of the algebra doesn't optimize any more. >> >> >