So the alternatives are:

1- mapBlock to a matrix whose all rows-but-the first are empty, then
aggregate
2- depend on a backend

1 is obviously OK.

I don't like the idea of depending on a backend since SGD is a generic loss
minimization, on which other algorithms will possibly depend.

In this context, client-side aggregation is not an overhead, but even if it
happens to be so, it doesn't have to be a client-side aggregate at all.

Alternative to 1, I am thinking of at least having an aggregation
operation, which will return an accumulated value anyway, and shouldn't
affect algebra optimizations.

I quickly implemented a naive one (supporting only Spark- I know I said
that I don't like depending on a backend, but at least the backends-wide
interface is consistent, and as a client, I still don't have to deal with
Spark primitives directly).

Is this nice enough? Is it too bad to have in the DSL?
https://github.com/gcapan/mahout/compare/accumulateblocks

Best

Gokhan

On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

> Oh. algorithm actually collects the vectors and runs another cycle in the
> client!
>
> Still, technically, you can collect almost-empty blocks to the client
> (since they are mostly empty, it won't cause THAT huge overhead compared to
> collecting single vectors, after all, how many partitions are we talking
> about? 1000? ).
>
> On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
>>
>>
>> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
>>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
>>> implement SGD, and a regularized least squares solution for linear
>>> regression (can easily be extended to other GLMs, too).
>>>
>>> How the algorithm works is as follows:
>>> 1. Split data into partitions of T examples
>>> 2. in parallel, for each partition:
>>>    2.0. shuffle partition
>>>    2.1. initialize parameter vector
>>>    2.2. for each example in the shuffled partition
>>>        2.2.1 update the parameter vector
>>> 3. Aggregate all the parameter vectors and return
>>>
>>
>> I guess technically it is possible (transform each block to a
>> SparseRowMatrix or SparseMatrix with only first valid row) and then invoke
>> colSums() or colMeans() (whatever aggregate means).
>>
>> However, i am not sure it is worth the ugliness. isn't it easier to
>> declare these things quasi-algebraic and just do direct spark calls on the
>> matrix RDD (map, aggregate)?
>>
>> The real danger is to introduce non-algebra things into algebra so that
>> the rest of the algebra doesn't optimize any more.
>>
>>
>

Reply via email to