Re: Mahout 1.0 goals

Sebastian Schelter Tue, 04 Mar 2014 14:26:21 -0800

JBlas gave roughly 5x -7x performance for solving the dense linearsystems in ALS when I integrated it into a prototype of Mahout's ALS fora research paper.


There are some caveats with it unfortunately:

- it requires certain fortran libs to be installed on the machines ofthe cluster

- its jar is really huge, so it would blow up the size of "uber-jars"built from mahout

- AFAIK its also a problem to ship it license-wise as the requiredlibraries would not be Apache licensed


See this discussion from the Spark community for details:

https://github.com/apache/incubator-spark/pull/575


Best,
Sebastian

On 03/04/2014 11:17 PM, Suneel Marthi wrote:

There's JBlas which is used by Spark, Deeplearning.org and other Ml projects.  
IIRC, there was some prototyping done in the past using JBlas for Mahout - 
Sebastian or Sean can better speak to that?  It definitely has better 
performance than Mahout-Math.

Managing the native Fortran dependencies could be challenging with JBlas, not 
to mention that JBlas may not support sparse matrices (someone correct me here).






On Tuesday, March 4, 2014 4:57 PM, Giorgio Zoppi <[email protected]> 
wrote:

I would like to find some way of speed up matrix library, ie JNI+C++.


2014-03-04 22:53 GMT+01:00 Frank Scholten <[email protected]>:

Yes, I like to work on standardizing the code around input formats.


On Mon, Mar 3, 2014 at 7:37 PM, Suneel Marthi <[email protected]

wrote:

To get things moving for 1.0:


a) Address the 4 issues that Sean had raised - we have already started
looking at Backlog and

  closing them, started looking at converting old

MapReduce to newer MapReduce API.

     If someone could start looking at standardizing the input/output
formats across classifiers, clustering and recommenders that would be
great.  Guess Frank S. has already started work in that direction.

b)  Need a better and cleaner serialized form of Vectors to handle names
and other kind'a stuff, this is gonna impact everything that's presently
implemented.

c)  Agree with ssc, to start looking at Spark-Mahout integration.


d) Need volunteers to QA/address issues with the present
classifiers/clustering algorithms. I personally can vouch for how
disastrous it is to deploy any of Mahout's classifiers/clustering
implementations in an Operations environment. A good example of that is
Sean's recent patch for RDF.

Naive Bayes code as it is now seems half-baked and is incomplete. Not
every code path has been tested on Streaming KMeans.

This should go some way in addressing the technical debt that's been

piled

over the years.





On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <[email protected]>
wrote:

I would like to discuss whether we should start to have some
Spark-related code in Mahout.

--sebastian


On 03/03/2014 06:56 PM, Suneel Marthi wrote:

Grant had setup a Google Hangout for Mahout sometime last year before

0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
want to have a hangout on Saturday or weekend.






On Monday, March 3, 2014 12:52

  PM, Ted Dunning <[email protected]>

wrote:


Happy to organize a google hangout.  That has the advantage of allowing

more attendees and supporting YouTube archiving.


Sent from my iPhone

On Mar 3, 2014, at 9:34, Giorgio Zoppi <[email protected]>

wrote:


Hello All,
Dr.Dunning could you set a meeting next Sat morning, so we can chat

and

discuss by skype improvements and what to do and indentify volunteer

and

tasks.
Best Regards,
Giorgio


2014-03-03 18:30 GMT+01:00 peng <[email protected]>:

Me three

On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:

Ravi,

  >>>> Good points.


On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <

[email protected]>

wrote:

- Natively support Windows (guidance, etc. No documentation exists

today,

for instance)

There is a bit of demand for that.

- Faster time to first application (from discovery to first

application

  >>>>> currently takes a non-trivial amount of effort; how can we lower

the

bar

and reduce the friction for adoption?)

There is huge evidence that this is important.


      - Better documenting use cases with working samples/examples

(Documentation
on https://mahout.apache.org/users/basics/algorithms.html is

spread

out

and
there is too much

  focus on algorithms as opposed to use cases -

this

is

an
adoption

   blocker)

This is also important.


- Uniformity of the API set across all algorithms (are we providing

the

same experience across all APIs?)

And many people have been tripped up by this.


      - Measuring/publishing scalability metrics of various algorithms

(why

would
we want users to adopt Mahout vs. other frameworks for ML at

scale?)

I don't see this as important as some of your other points, but is

still

useful.



--
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Reply via email to