Hi,
Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
implement SGD, and a regularized least squares solution for linear
regression (can easily be extended to other GLMs, too).
How the algorithm works is as follows:
1. Split data into partitions of T examples
2. in parallel, for each partition:
2.0. shuffle partition
2.1. initialize parameter vector
2.2. for each example in the shuffled partition
2.2.1 update the parameter vector
3. Aggregate all the parameter vectors and return
Here is an initial implementation to illustrate where I am stuck:
https://github.com/gcapan/mahout/compare/optimization
(See TODO in SGD.minimizeWithSgd[K])
I was thinking that using a blockified matrix of training instances, step 2
of the algorithm can run on blocks, and they can be aggregated in
client-side. However, the only operator that I know in the DSL is mapBlock,
and it requires the BlockMapFunction to map a block to another block of the
same row size. In this context, I want to map a block (numRows x n) to the
parameter vector of size n.
The question is:
1- Is it possible to easily implement the above algorithm using DSL's
current functionality? Could you tell me what I'm missing?
2- If there is not an easy way other than using the currently-non-existing
mapBlock-like method, shall we add such an operator?
Best,
Gokhan