Yeah- I don't think this can be easily and efficiently implented in the DSL as 
is.  You'd have to iterativly rowBind all of the vectors returned by 
minimizePartial(...)  onto a (k x n) matrix within your mapBlock(Matrix(m x n)) 
bmf where k is the number of blocks and m is the number of (total) 
observations.  mapBlock(...) requires that you return a matrix with the same 
number of rows.  So with k!=m this is difficult to do in a straightforward way.

What about implementing it sequentially in the pure DSL and then extending and  
overriding the higher order function (to calculate the bVector at each 
iteration) to the spark module and like Pat said use the Spark operations?

> Subject: Re: SGD Implementation and Questions for mapBlock like functionality
> From: p...@occamsmachete.com
> Date: Tue, 11 Nov 2014 09:54:52 -0800
> To: dev@mahout.apache.org
> 
> Still not sure what you need but if mapBlock and broadcast vals aren’t enough 
> you’ll have to look at Spark’s available operations like join, reduce, etc. 
> As well as the Spark accumulators. None of these have been made generic 
> enough for the DSL yet AFAIK. I use accumulators in Spark specific code but 
> that doesn’t need to be reflected in the DSL. You’ll have to decide if the 
> new ops you need are worth putting in the DSL or just leaving in your 
> engine-specific implementation.
>  
> On Nov 10, 2014, at 10:47 AM, Gokhan Capan <gkhn...@gmail.com> wrote:
> 
> Well, in that specific case, I will accumulate in the client side,
> collection of the intermediate parameters is not that big (numBlocks x
> X.ncol). What I need is just mapping (keys, block) to a vector (currently,
> a mapBlock has to map the block to the new block)
> 
> From a general perspective, you are right, this is an accumulation.
> 
> Gokhan
> 
> On Mon, Nov 10, 2014 at 8:26 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> > Do you need a reduce or could you use an accumulator? Either is not really
> > supported in the DSL but clearly these are required for certain algos.
> > Broadcast vals supported but are read only.
> > 
> > On Nov 8, 2014, at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> > 
> > Hi,
> > 
> > Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
> > http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
> > implement SGD, and a regularized least squares solution for linear
> > regression (can easily be extended to other GLMs, too).
> > 
> > How the algorithm works is as follows:
> > 1. Split data into partitions of T examples
> > 2. in parallel, for each partition:
> >  2.0. shuffle partition
> >  2.1. initialize parameter vector
> >  2.2. for each example in the shuffled partition
> >      2.2.1 update the parameter vector
> > 3. Aggregate all the parameter vectors and return
> > 
> > Here is an initial implementation to illustrate where I am stuck:
> > https://github.com/gcapan/mahout/compare/optimization
> > 
> > (See TODO in SGD.minimizeWithSgd[K])
> > 
> > I was thinking that using a blockified matrix of training instances, step 2
> > of the algorithm can run on blocks, and they can be aggregated in
> > client-side. However, the only operator that I know in the DSL is mapBlock,
> > and it requires the BlockMapFunction to map a block to another block of the
> > same row size. In this context, I want to map a block (numRows x n) to the
> > parameter vector of size n.
> > 
> > The question is:
> > 1- Is it possible to easily implement the above algorithm using DSL's
> > current functionality? Could you tell me what I'm missing?
> > 2- If there is not an easy way other than using the currently-non-existing
> > mapBlock-like method, shall we add such an operator?
> > 
> > Best,
> > 
> > Gokhan
> > 
> > 
> 
                                          

Reply via email to