Do you need a reduce or could you use an accumulator? Either is not really supported in the DSL but clearly these are required for certain algos. Broadcast vals supported but are read only.
On Nov 8, 2014, at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote: Hi, Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( http://martin.zinkevich.org/publications/nips2010.pdf), I tried to implement SGD, and a regularized least squares solution for linear regression (can easily be extended to other GLMs, too). How the algorithm works is as follows: 1. Split data into partitions of T examples 2. in parallel, for each partition: 2.0. shuffle partition 2.1. initialize parameter vector 2.2. for each example in the shuffled partition 2.2.1 update the parameter vector 3. Aggregate all the parameter vectors and return Here is an initial implementation to illustrate where I am stuck: https://github.com/gcapan/mahout/compare/optimization (See TODO in SGD.minimizeWithSgd[K]) I was thinking that using a blockified matrix of training instances, step 2 of the algorithm can run on blocks, and they can be aggregated in client-side. However, the only operator that I know in the DSL is mapBlock, and it requires the BlockMapFunction to map a block to another block of the same row size. In this context, I want to map a block (numRows x n) to the parameter vector of size n. The question is: 1- Is it possible to easily implement the above algorithm using DSL's current functionality? Could you tell me what I'm missing? 2- If there is not an easy way other than using the currently-non-existing mapBlock-like method, shall we add such an operator? Best, Gokhan