So you are following #2, which is good. #1 seems a bit like a hack. For a long 
time to come we will have to add things to the DSL if it is to be kept engine 
independent. Yours looks pretty general and simple. 

Are you familiar with the existing Mahout aggregate methods? They show up in 
the SGDHelper.java and other places in legacy code. I don’t know much about 
them but they seem to be a pre-functional programming attempt at this kind of 
thing. It looks like you are proposing a replacement for those based on 
rdd.aggregate, if so that would be very useful. For one thing it looks like the 
old aggregate was not parallel, rdd.aggregate is.


On Nov 11, 2014, at 1:18 PM, Gokhan Capan <gkhn...@gmail.com> wrote:

So the alternatives are:

1- mapBlock to a matrix whose all rows-but-the first are empty, then
aggregate
2- depend on a backend

1 is obviously OK.

I don't like the idea of depending on a backend since SGD is a generic loss
minimization, on which other algorithms will possibly depend.

In this context, client-side aggregation is not an overhead, but even if it
happens to be so, it doesn't have to be a client-side aggregate at all.

Alternative to 1, I am thinking of at least having an aggregation
operation, which will return an accumulated value anyway, and shouldn't
affect algebra optimizations.

I quickly implemented a naive one (supporting only Spark- I know I said
that I don't like depending on a backend, but at least the backends-wide
interface is consistent, and as a client, I still don't have to deal with
Spark primitives directly).

Is this nice enough? Is it too bad to have in the DSL?
https://github.com/gcapan/mahout/compare/accumulateblocks

Best

Gokhan

On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

> Oh. algorithm actually collects the vectors and runs another cycle in the
> client!
> 
> Still, technically, you can collect almost-empty blocks to the client
> (since they are mostly empty, it won't cause THAT huge overhead compared to
> collecting single vectors, after all, how many partitions are we talking
> about? 1000? ).
> 
> On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> 
>> 
>> 
>> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
>>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
>>> implement SGD, and a regularized least squares solution for linear
>>> regression (can easily be extended to other GLMs, too).
>>> 
>>> How the algorithm works is as follows:
>>> 1. Split data into partitions of T examples
>>> 2. in parallel, for each partition:
>>>   2.0. shuffle partition
>>>   2.1. initialize parameter vector
>>>   2.2. for each example in the shuffled partition
>>>       2.2.1 update the parameter vector
>>> 3. Aggregate all the parameter vectors and return
>>> 
>> 
>> I guess technically it is possible (transform each block to a
>> SparseRowMatrix or SparseMatrix with only first valid row) and then invoke
>> colSums() or colMeans() (whatever aggregate means).
>> 
>> However, i am not sure it is worth the ugliness. isn't it easier to
>> declare these things quasi-algebraic and just do direct spark calls on the
>> matrix RDD (map, aggregate)?
>> 
>> The real danger is to introduce non-algebra things into algebra so that
>> the rest of the algebra doesn't optimize any more.
>> 
>> 
> 

Reply via email to