Re: SGD Implementation and Questions for mapBlock like functionality

Pat Ferrel Mon, 10 Nov 2014 10:29:02 -0800

Do you need a reduce or could you use an accumulator? Either is not really 
supported in the DSL but clearly these are required for certain algos. 
Broadcast vals supported but are read only.


On Nov 8, 2014, at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote:

Hi,

Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
implement SGD, and a regularized least squares solution for linear
regression (can easily be extended to other GLMs, too).

How the algorithm works is as follows:
1. Split data into partitions of T examples
2. in parallel, for each partition:
  2.0. shuffle partition
  2.1. initialize parameter vector
  2.2. for each example in the shuffled partition
      2.2.1 update the parameter vector
3. Aggregate all the parameter vectors and return

Here is an initial implementation to illustrate where I am stuck:
https://github.com/gcapan/mahout/compare/optimization

(See TODO in SGD.minimizeWithSgd[K])

I was thinking that using a blockified matrix of training instances, step 2
of the algorithm can run on blocks, and they can be aggregated in
client-side. However, the only operator that I know in the DSL is mapBlock,
and it requires the BlockMapFunction to map a block to another block of the
same row size. In this context, I want to map a block (numRows x n) to the
parameter vector of size n.

The question is:
1- Is it possible to easily implement the above algorithm using DSL's
current functionality? Could you tell me what I'm missing?
2- If there is not an easy way other than using the currently-non-existing
mapBlock-like method, shall we add such an operator?

Best,

Gokhan

Re: SGD Implementation and Questions for mapBlock like functionality

Reply via email to