Still not sure what you need but if mapBlock and broadcast vals aren’t enough 
you’ll have to look at Spark’s available operations like join, reduce, etc. As 
well as the Spark accumulators. None of these have been made generic enough for 
the DSL yet AFAIK. I use accumulators in Spark specific code but that doesn’t 
need to be reflected in the DSL. You’ll have to decide if the new ops you need 
are worth putting in the DSL or just leaving in your engine-specific 
implementation.
 
On Nov 10, 2014, at 10:47 AM, Gokhan Capan <gkhn...@gmail.com> wrote:

Well, in that specific case, I will accumulate in the client side,
collection of the intermediate parameters is not that big (numBlocks x
X.ncol). What I need is just mapping (keys, block) to a vector (currently,
a mapBlock has to map the block to the new block)

From a general perspective, you are right, this is an accumulation.

Gokhan

On Mon, Nov 10, 2014 at 8:26 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Do you need a reduce or could you use an accumulator? Either is not really
> supported in the DSL but clearly these are required for certain algos.
> Broadcast vals supported but are read only.
> 
> On Nov 8, 2014, at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> 
> Hi,
> 
> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
> implement SGD, and a regularized least squares solution for linear
> regression (can easily be extended to other GLMs, too).
> 
> How the algorithm works is as follows:
> 1. Split data into partitions of T examples
> 2. in parallel, for each partition:
>  2.0. shuffle partition
>  2.1. initialize parameter vector
>  2.2. for each example in the shuffled partition
>      2.2.1 update the parameter vector
> 3. Aggregate all the parameter vectors and return
> 
> Here is an initial implementation to illustrate where I am stuck:
> https://github.com/gcapan/mahout/compare/optimization
> 
> (See TODO in SGD.minimizeWithSgd[K])
> 
> I was thinking that using a blockified matrix of training instances, step 2
> of the algorithm can run on blocks, and they can be aggregated in
> client-side. However, the only operator that I know in the DSL is mapBlock,
> and it requires the BlockMapFunction to map a block to another block of the
> same row size. In this context, I want to map a block (numRows x n) to the
> parameter vector of size n.
> 
> The question is:
> 1- Is it possible to easily implement the above algorithm using DSL's
> current functionality? Could you tell me what I'm missing?
> 2- If there is not an easy way other than using the currently-non-existing
> mapBlock-like method, shall we add such an operator?
> 
> Best,
> 
> Gokhan
> 
> 

Reply via email to