[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2015-02-05 Thread Sebastiano Vigna (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308162#comment-14308162
 ] 

Sebastiano Vigna commented on MAHOUT-1640:
--

There is now a pull request, as requested:

https://github.com/apache/mahout/pull/73

> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Faster collections for a faster Mahout

2015-02-05 Thread Sebastiano Vigna

> On 5 Feb 2015, at 22:24, Dmitriy Lyubimov  wrote:
> 
> thank you very much.
> 
> Github pull request is what we use these days. Do you think you could put
> one up ?

I did it, and I hope is fine. Can you check? Sorry, it's my first fork/pull...

Ciao,

seba



Re: Codebase refactoring proposal

2015-02-05 Thread Dmitriy Lyubimov
On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan  wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
>
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
>

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?


>
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).


>
>


Re: Faster collections for a faster Mahout

2015-02-05 Thread Dmitriy Lyubimov
thank you very much.

Github pull request is what we use these days. Do you think you could put
one up ?

thanks.
-d

On Thu, Feb 5, 2015 at 1:17 AM, Sebastiano Vigna  wrote:

>
> > On 19 Jan 2015, at 22:26, Robin Anil  wrote:
> >
> > @Sebastiano, sounds like an easy win. Can you file a JIRA ticket with a
> > patch and before/after dump from benchmarks. See an example ticket
> > 
> >
> > Robin
>
> Done: https://issues.apache.org/jira/browse/MAHOUT-1640
>
> My first git patch, hope to not have messed up! :)
>
> It would be very interesting to see whether there are detectable speed
> changes in higher-level classes using sparse vectors.
>
> Ciao,
>
> seba
>
>


Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters-II #1090

2015-02-05 Thread Apache Jenkins Server
See 



Re: Codebase refactoring proposal

2015-02-05 Thread Pat Ferrel
From my own perspective:

I’m not aware of any rule to make all operations agnostic. In fact several 
engine specific exceptions are discussed in this long email. We’ve talked about 
reduce or join operations that would be difficult to make agnostic without a 
lot of knowledge of ALL other engines. Unless or until we get contributors from 
those engines reviewing commits, why put this burden on all of us?

An agnostic DSL was for linear algebra ops, not all distributed computation 
methods. We aren’t doing a generic engine only engine agnostic algebra. 

You have added stubs in H2O for the distributed aggregations. This seems fine 
but I wouldn’t vote to require that. If GSGD requires further use of Spark 
specific operations, so be it. This means that GSGD may live in the Spark 
module with any algebra bits required  added to math-scala. Does anyone have a 
problem with that?

My vote on #62—ship it.

On the point of interoperability with MLlib we still need talk about that but 
another email.


On Feb 5, 2015, at 1:14 AM, Gokhan Capan  wrote:

What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov  wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
> 
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
> 
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel  wrote:
> 
>> BTW Ted and Andrew have both expressed interest in the distributed
>> aggregation stuff. It sounds like we are agreeing that
>> non-algebra—computation method type things can be engine specific.
>> 
>> So does anyone have an objection to Gokhan pushing his PR?
>> 
>> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:
>> 
>> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo 
> wrote:
>> 
>>> 
>>> 
>>> 
>>> My thought was not to bring primitive engine specific aggregetors,
>>> combiners,  etc. into math-scala.
>>> 
>> 
>> Yeah. +1. I would like to support that as an experiment, see where it
> goes.
>> Clearly some distributed use cases are simple enough while also pervasive
>> enough.
>> 
>> 
> 



Re: Faster collections for a faster Mahout

2015-02-05 Thread Sebastiano Vigna

> On 19 Jan 2015, at 22:26, Robin Anil  wrote:
> 
> @Sebastiano, sounds like an easy win. Can you file a JIRA ticket with a
> patch and before/after dump from benchmarks. See an example ticket
> 
> 
> Robin

Done: https://issues.apache.org/jira/browse/MAHOUT-1640

My first git patch, hope to not have messed up! :)

It would be very interesting to see whether there are detectable speed changes 
in higher-level classes using sparse vectors.

Ciao,

seba



[jira] [Updated] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2015-02-05 Thread Sebastiano Vigna (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastiano Vigna updated MAHOUT-1640:
-
Attachment: speed-std
speed-fastutil
fastutil.patch

> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Codebase refactoring proposal

2015-02-05 Thread Gokhan Capan
What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov  wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
>
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
>
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel  wrote:
>
> > BTW Ted and Andrew have both expressed interest in the distributed
> > aggregation stuff. It sounds like we are agreeing that
> > non-algebra—computation method type things can be engine specific.
> >
> > So does anyone have an objection to Gokhan pushing his PR?
> >
> > On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:
> >
> > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo 
> wrote:
> >
> > >
> > >
> > >
> > > My thought was not to bring primitive engine specific aggregetors,
> > > combiners,  etc. into math-scala.
> > >
> >
> > Yeah. +1. I would like to support that as an experiment, see where it
> goes.
> > Clearly some distributed use cases are simple enough while also pervasive
> > enough.
> >
> >
>


[jira] [Commented] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2015-02-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306888#comment-14306888
 ] 

ASF GitHub Bot commented on MAHOUT-1626:


Github user gcapan commented on the pull request:

https://github.com/apache/mahout/pull/62#issuecomment-73015659
  
* The first and simplest will be Zinkevich et al.'s Parallelized Stochastic 
Gradient Descent [1]:  The algorithm is basically running multiple local SGD's 
in parallel, then averaging them. For implementation, I was thinking of running 
SGD's locally in blocks of rows and averaging them.

* Further, I hope to implement distributed stratified SGD for matrix 
factorization (Gemulla et al.)[2]: The algorithm is forming strata (where each 
stratum consists of a set of blocks that do not share any rows or columns), 
then for each stratum, performing SGD updates in parallel.

I am not yet sure if the latter would require additional non-DSL stuff. I 
will raise my concerns once I get to it.

[1] http://martin.zinkevich.org/publications/nips2010.pdf
[2] http://dl.acm.org/citation.cfm?id=2020426

 


> Support for required quasi-algebraic operations and starting with aggregating 
> rows/blocks
> -
>
> Key: MAHOUT-1626
> URL: https://issues.apache.org/jira/browse/MAHOUT-1626
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 1.0
>Reporter: Gokhan Capan
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2015-02-05 Thread Sebastiano Vigna (JIRA)
Sebastiano Vigna created MAHOUT-1640:


 Summary: Better collections would significantly improve 
vector-operation speed
 Key: MAHOUT-1640
 URL: https://issues.apache.org/jira/browse/MAHOUT-1640
 Project: Mahout
  Issue Type: Improvement
  Components: collections
 Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 14.1.0: 
Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64 i386 
MacBookPro10,1 Darwin

java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Reporter: Sebastiano Vigna


The collections currently used by Mahout to implement sparse vectors are 
extremely slow. The proposed patch (localized to RandomAccessSparseVector) uses 
fastutil's maps and the speed improvements in vector benchmarks are very 
significant. It would be interesting to see whether these improvements 
percolate to high-level classes using sparse vectors.

I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
were exposed by the different order in which key/values were returned by 
iterators).

The included files speed-std and speed-fastutil show the speed improvement. 
Some more speed might be gained by using everywhere the standard 
java.util.Map.Entry interface instead of Element.

DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
vectors. The standard tests multiply two random vectors, so in fact they just 
test the speed of the underlying map remove() method, as almost all products 
are zero. This is not very realistic and was heavily penalizing fastutil's 
"true deletions". Better tests, with a typical overlap of nonzero entries, 
would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)