Well, in that specific case, I will accumulate in the client side, collection of the intermediate parameters is not that big (numBlocks x X.ncol). What I need is just mapping (keys, block) to a vector (currently, a mapBlock has to map the block to the new block)
>From a general perspective, you are right, this is an accumulation. Gokhan On Mon, Nov 10, 2014 at 8:26 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Do you need a reduce or could you use an accumulator? Either is not really > supported in the DSL but clearly these are required for certain algos. > Broadcast vals supported but are read only. > > On Nov 8, 2014, at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote: > > Hi, > > Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( > http://martin.zinkevich.org/publications/nips2010.pdf), I tried to > implement SGD, and a regularized least squares solution for linear > regression (can easily be extended to other GLMs, too). > > How the algorithm works is as follows: > 1. Split data into partitions of T examples > 2. in parallel, for each partition: > 2.0. shuffle partition > 2.1. initialize parameter vector > 2.2. for each example in the shuffled partition > 2.2.1 update the parameter vector > 3. Aggregate all the parameter vectors and return > > Here is an initial implementation to illustrate where I am stuck: > https://github.com/gcapan/mahout/compare/optimization > > (See TODO in SGD.minimizeWithSgd[K]) > > I was thinking that using a blockified matrix of training instances, step 2 > of the algorithm can run on blocks, and they can be aggregated in > client-side. However, the only operator that I know in the DSL is mapBlock, > and it requires the BlockMapFunction to map a block to another block of the > same row size. In this context, I want to map a block (numRows x n) to the > parameter vector of size n. > > The question is: > 1- Is it possible to easily implement the above algorithm using DSL's > current functionality? Could you tell me what I'm missing? > 2- If there is not an easy way other than using the currently-non-existing > mapBlock-like method, shall we add such an operator? > > Best, > > Gokhan > >