Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-27 Thread Nikaash Puri
Hi Pat, Dmitriy

Thanks so much. Will run some more experiments to validate the initial
outcomes. A Spark upgrade is definitely in the pipeline and will likely
solve some of these performance issues.

Pat, yup, the tests were conducted under identical code bases and data sets
other than the Mahout version change. I'm sorry, the data is sensitive so
sharing it won't be possible.

Also, as far as I can tell, spark-itemsimilarity now uses computeAtBzipped3
instead of computeAtBzipped for AtB. Although, this is meant to speed
things up, so not sure its relevant as far as this problem is concerned.

Thank you,
Nikaash Puri

On Thu, Apr 28, 2016 at 2:50 AM Pat Ferrel  wrote:

> I have been using the same function through all those versions of Mahout.
> I’m running on newer versions of Spark 1.4-1.6.2. Using my datasets there
> has been no slowdown. I assume that you are only changing the Mahout
> version—leaving data, Spark, HDFS, and all config the same. In which case I
> wonder if you are somehow running into limits of your machine like memory?
> Have you allocated a fixed executor memory limit?
>
> There has been almost no code change to item similarity. Dmitriy, do you
> know if the underlying AtB has changed? I seem to recall the partitioning
> was set to “auto” about 0.11. We were having problems with large numbers of
> small part files from Spark Streaming causing partitioning headaches as I
> recall. In some unexpected way the input structure was trickling down into
> partitioning decisions made in Spark.
>
> The first thing I’d try is giving the job more executor memory, the second
> is to upgrade Spark. A 3x increase in execution speed is a pretty big deal
> if it isn’t helped with these easy fixes so can you share your data?
>
> On Apr 27, 2016, at 8:37 AM, Dmitriy Lyubimov  wrote:
>
> 0.11 targets 1.3+.
>
> I don't quite have anything on top of my head affecting A'B specifically,
> but i think there were some chanages affecting in-memory multiplication
> (which is of course used in distributed A'B).
>
> I am not in particular familiar or remember details of row similarity on
> top of my head, i really wish the original contributor would comment on
> that. trying to see if i can come up with anything useful though.
>
> what behavior do you see in this job -- cpu-bound or i/o bound?
>
> there are a few pointers to look at:
>
> (1)  I/O many times exceeds the input size, so spills are inevitable. So
> tuning memory sizes and look at spark spill locations to make sure disks
> are not slow there is critical. Also, i think in spark 1.6 spark added a
> lot of flexibility in managing task/cache/shuffle memory sizes, it may help
> in some unexpected way.
>
> (2) sufficient cache: many pipelines commit reused matrices into cache
> (MEMORY_ONLY) which is the default mahout algebra behavior, assuming there
> is enough cache memory there for only good things to happen. if it is not,
> however, it will cause recomputation of results that were evicted. (not
> saying it is a known case for row similarity in particular). make sure this
> is not the case. For cases of scatter type exchanges it is especially super
> bad.
>
> (3) A'B -- try to hack and play with implemetnation there in AtB (spark
> side) class. See if you can come up with a better arrangement.
>
> (4) in-memory computations (MMul class) if that's the bottleneck can be in
> practice quick-hacked with mutlithreaded multiplication and bridge to
> native solvers (netlib-java) at least for dense cases. this is found to
> improve performance of distributed multiplications a bit. Works best if you
> get 2 threads in the backend and all threads in the front end.
>
> There are other known things that can improve speed multiplication of the
> public mahout version, i hope mahout will improve on those in the future.
>
> -d
>
> On Wed, Apr 27, 2016 at 6:14 AM, Nikaash Puri 
> wrote:
>
> > Hi,
> >
> > I’ve been working with LLR in Mahout for a while now. Mostly using the
> > SimilarityAnalysis.cooccurenceIDss function. I recently upgraded the
> Mahout
> > libraries to 0.11, and subsequently also tried with 0.12 and the same
> > program is running orders of magnitude slower (at least 3x based on
> initial
> > analysis).
> >
> > Looking into the tasks more carefully, comparing 0.10 and 0.11 shows that
> > the amount of Shuffle being done in 0.11 is significantly higher,
> > especially in the AtB step. This could possibly be a reason for the
> > reduction in performance.
> >
> > Although, I am working on Spark 1.2.0. So, its possible that this could
> be
> > causing the problem. It works fine with Mahout 0.10.
> >
> > Any ideas why this might be happening?
> >
> > Thank you,
> > Nikaash Puri
>
>


[jira] [Closed] (MAHOUT-1705) Verify dependencies in job jar for mahout-examples

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed MAHOUT-1705.
-

> Verify dependencies in job jar for mahout-examples
> --
>
> Key: MAHOUT-1705
> URL: https://issues.apache.org/jira/browse/MAHOUT-1705
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> mahout-example-*-job.jar is around ~56M, and may package unused runtime 
> libraries.  We need to go through this and make sure that there is nothing 
> unneeded or redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1705) Verify dependencies in job jar for mahout-examples

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1705:
--
Fix Version/s: 0.12.0

> Verify dependencies in job jar for mahout-examples
> --
>
> Key: MAHOUT-1705
> URL: https://issues.apache.org/jira/browse/MAHOUT-1705
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> mahout-example-*-job.jar is around ~56M, and may package unused runtime 
> libraries.  We need to go through this and make sure that there is nothing 
> unneeded or redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1740) Layout on algorithms page broken

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed MAHOUT-1740.
-

> Layout on algorithms page broken
> 
>
> Key: MAHOUT-1740
> URL: https://issues.apache.org/jira/browse/MAHOUT-1740
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> http://mahout.apache.org/users/basics/algorithms.html
> On Chrome on Linux the main body content is bleeding into the right nav. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1811) Fix calculation of second norm of DRM in Flink

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed MAHOUT-1811.
-

> Fix calculation of second norm of DRM in Flink
> --
>
> Key: MAHOUT-1811
> URL: https://issues.apache.org/jira/browse/MAHOUT-1811
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1764) Mahout DSL for Flink: Add standard backend tests for Flink

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1764:
--
Fix Version/s: 0.12.0

> Mahout DSL for Flink: Add standard backend tests for Flink
> --
>
> Key: MAHOUT-1764
> URL: https://issues.apache.org/jira/browse/MAHOUT-1764
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> From github comment by Dmitriy:
> also on the topic of test suite coverage: we need to pass our standard tests. 
> The base clases for them are:
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/RLikeDrmOpsSuiteBase.scala
> The technique here is to take these test cases as a base class for a 
> distributed test case (you may want to see how it was done for Spark and 
> H2O). This is our basic assertion that our main algorithms are passing on a 
> toy problem for a given backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1764) Mahout DSL for Flink: Add standard backend tests for Flink

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed MAHOUT-1764.
-

> Mahout DSL for Flink: Add standard backend tests for Flink
> --
>
> Key: MAHOUT-1764
> URL: https://issues.apache.org/jira/browse/MAHOUT-1764
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> From github comment by Dmitriy:
> also on the topic of test suite coverage: we need to pass our standard tests. 
> The base clases for them are:
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/decompositions/DistributedDecompositionsSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeSuiteBase.scala
> https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/drm/RLikeDrmOpsSuiteBase.scala
> The technique here is to take these test cases as a base class for a 
> distributed test case (you may want to see how it was done for Spark and 
> H2O). This is our basic assertion that our main algorithms are passing on a 
> toy problem for a given backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1811) Fix calculation of second norm of DRM in Flink

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1811:
--
Fix Version/s: 0.12.0

> Fix calculation of second norm of DRM in Flink
> --
>
> Key: MAHOUT-1811
> URL: https://issues.apache.org/jira/browse/MAHOUT-1811
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1740) Layout on algorithms page broken

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1740:
--
Fix Version/s: 0.12.0

> Layout on algorithms page broken
> 
>
> Key: MAHOUT-1740
> URL: https://issues.apache.org/jira/browse/MAHOUT-1740
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> http://mahout.apache.org/users/basics/algorithms.html
> On Chrome on Linux the main body content is bleeding into the right nav. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1777) move HDFSUtil classes into the HDFS module

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1777:
--
Fix Version/s: 0.12.0

> move HDFSUtil classes into the HDFS module
> --
>
> Key: MAHOUT-1777
> URL: https://issues.apache.org/jira/browse/MAHOUT-1777
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Andrew Palumbo
> Fix For: 0.12.0
>
>
> The HDFSUtil classes are used by spark, h2o and flink and implemented in each 
> module.  Move them to the common HDFS module.  The spark implementation 
> includes a  {{delete(path: String)}} method used by the Spark Naive Bayes CLI 
> otherwise the others are nearly identical.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1777) move HDFSUtil classes into the HDFS module

2016-04-27 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed MAHOUT-1777.
-

> move HDFSUtil classes into the HDFS module
> --
>
> Key: MAHOUT-1777
> URL: https://issues.apache.org/jira/browse/MAHOUT-1777
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Andrew Palumbo
>
> The HDFSUtil classes are used by spark, h2o and flink and implemented in each 
> module.  Move them to the common HDFS module.  The spark implementation 
> includes a  {{delete(path: String)}} method used by the Spark Naive Bayes CLI 
> otherwise the others are nearly identical.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout contributions

2016-04-27 Thread Andrew Palumbo
Saikat, 

One other thing that I should say is that you do not need clearance or input 
from the committers to begin work on your project, and the interest can and 
should come from the community as a whole. You can write proposal as you've 
done, and if you don't see any "+1"s or responses from the community at whole 
with in a few days, you may want to explain in more detail, give examples and 
use cases.  If you are still not seeing +1s or any responses from others then I 
think you can assume that there may not be interest; this is usually how things 
work.  

However if its something that your passionate about and you feel like you can 
deliver this should not to stop you.  People do not always read the dev@ emails 
or have time to respond.  You can still move forward with your proposed 
contribution by following the steps laid out in my previous email; follow the 
protocol at:
 
http://mahout.apache.org/developers/how-to-contribute.html

and create a JIRA.  When you have reached a significant amount of completion 
(around 70-80%), open a PR for review, this way you can explain in more detail. 

But please realize that when you open a JIRA for a new issue there is some 
expectation of a commitment on your part to complete it. 

For example, I am currently investigating some new plotting features.  I have 
spent a good deal of time this week and last already and am even mocking up 
code as a sketch of what may become an implementation before I open a "New 
Feature" JIRA for it.

My point is absolutely not to discourage you or anybody else from opening JIRAs 
for new features, rather to let you know that when you open an JIRA for a new 
issue, It tells others that your are working on it, and thus may discourage 
another with a similar idea to contribute this feature.  So it is best to open 
it once you've begun your work and are committed to it.
  
Andy


From: Saikat Kanjilal 
Sent: Wednesday, April 27, 2016 8:24 PM
To: dev@mahout.apache.org
Subject: RE: Mahout contributions

Andrew,Thank you very much for your input, I actually want to start a new set 
of JIRAs, here's what I want to work on, I want to build a framework that ties 
together search/visualization capability with some machine learning algorithms, 
so essentially think of it as tying in elasticsearch and kibana  into mahout , 
the user can search for their data with elasticsearch and for deeper analysis 
on that data they can feed that data into one or more mahout backends for 
analysis.  Another interesting tie in might be to hack kibana to render ggplot 
like graphics based on the output of mahout algorithms (assuming this can be a 
kibana plugin).
Before I go hog wild to create a bunch of JIRA's I'd like to know if there's 
interest in this initiative.  The tool will bring together the ELK stack with 
dynamic machine learning algorithms.  I can go into a lot more detail around 
use cases if there's enough interest.
Looking forward to your and other committers input.Thanks

> From: ap@outlook.com
> To: dev@mahout.apache.org
> Subject: Re: Mahout contributions
> Date: Wed, 27 Apr 2016 20:16:38 +
>
> Hello Saikat,
>
> #1 and #2 above are already implemented.  #4 is tricky so i would not 
> recommend without a strong knowledge of the codebase, and #5 is now 
> deprecated.  (I've just updated the algorithms grid to reflect this).  The 
> algorithms page includes both algorithms implemented in the math-scala 
> library and algorithms which have CLI drivers written for them.
>
> Please see: http://mahout.apache.org/developers/how-to-contribute.html
>
> And please note that per that documentation, it is in everybody's best 
> interest to keep messages on list, contacting committers directly is 
> discouraged.
>
> The best way to contribute (if you have not found a new bug or issue) would 
> be for you to pick a single open issue in the mahout JIRA which is not 
> already assigned, and start work on it.  When your work is ready for review, 
> just open up a PR and the committers will review it.  Please note that if you 
> do pick up an issue to work on, we do expect some amount of responsibility 
> and reliability and tangible amount of satisfactory work since once you've 
> marked a JIRA as something you're working on, others will pass on it.
>
> Another good way to contribute would be to look for enhancements that could 
> make to existing code not necessarily open JIRAs that need to be assigned to 
> you.  For example please see the recent contribution and workflow on: 
> https://issues.apache.org/jira/browse/MAHOUT-1833 .
>
> If you have something new that you'd like to implement, simply start a new 
> JIRA issue and begin work on it.  In this case, when you have some code that 
> is ready for review,  you can simply open up a PR for it and committers will 
> review it.  For new implementations, we generally say that you should do this 
> when you are at least 70-80% 

[jira] [Commented] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261285#comment-15261285
 ] 

Hudson commented on MAHOUT-1836:


SUCCESS: Integrated in Mahout-Quality #3346 (See 
[https://builds.apache.org/job/Mahout-Quality/3346/])
MAHOUT-1836:reorder javadoc paramter comments closes apache/mahout#227 
(apalumbo: rev 9df093c98c985dce06841e31b40de4f7b9233069)
* mr/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java


> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Marku
>Assignee: Andrew Palumbo
>Priority: Trivial
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Mahout contributions

2016-04-27 Thread Saikat Kanjilal
Andrew,Thank you very much for your input, I actually want to start a new set 
of JIRAs, here's what I want to work on, I want to build a framework that ties 
together search/visualization capability with some machine learning algorithms, 
so essentially think of it as tying in elasticsearch and kibana  into mahout , 
the user can search for their data with elasticsearch and for deeper analysis 
on that data they can feed that data into one or more mahout backends for 
analysis.  Another interesting tie in might be to hack kibana to render ggplot 
like graphics based on the output of mahout algorithms (assuming this can be a 
kibana plugin).
Before I go hog wild to create a bunch of JIRA's I'd like to know if there's 
interest in this initiative.  The tool will bring together the ELK stack with 
dynamic machine learning algorithms.  I can go into a lot more detail around 
use cases if there's enough interest.
Looking forward to your and other committers input.Thanks

> From: ap@outlook.com
> To: dev@mahout.apache.org
> Subject: Re: Mahout contributions
> Date: Wed, 27 Apr 2016 20:16:38 +
> 
> Hello Saikat,
> 
> #1 and #2 above are already implemented.  #4 is tricky so i would not 
> recommend without a strong knowledge of the codebase, and #5 is now 
> deprecated.  (I've just updated the algorithms grid to reflect this).  The 
> algorithms page includes both algorithms implemented in the math-scala 
> library and algorithms which have CLI drivers written for them.  
> 
> Please see: http://mahout.apache.org/developers/how-to-contribute.html
> 
> And please note that per that documentation, it is in everybody's best 
> interest to keep messages on list, contacting committers directly is 
> discouraged.
> 
> The best way to contribute (if you have not found a new bug or issue) would 
> be for you to pick a single open issue in the mahout JIRA which is not 
> already assigned, and start work on it.  When your work is ready for review, 
> just open up a PR and the committers will review it.  Please note that if you 
> do pick up an issue to work on, we do expect some amount of responsibility 
> and reliability and tangible amount of satisfactory work since once you've 
> marked a JIRA as something you're working on, others will pass on it.
> 
> Another good way to contribute would be to look for enhancements that could 
> make to existing code not necessarily open JIRAs that need to be assigned to 
> you.  For example please see the recent contribution and workflow on: 
> https://issues.apache.org/jira/browse/MAHOUT-1833 .
> 
> If you have something new that you'd like to implement, simply start a new 
> JIRA issue and begin work on it.  In this case, when you have some code that 
> is ready for review,  you can simply open up a PR for it and committers will 
> review it.  For new implementations, we generally say that you should do this 
> when you are at least 70-80% finished with your coding.
> 
> Thank You,
> 
> Andy
> 
> 
> 
> 
> From: Saikat Kanjilal 
> Sent: Tuesday, April 26, 2016 7:17 PM
> To: dev@mahout.apache.org
> Subject: RE: Mahout contributions
> 
> Hello,Following up on my last email with more specifics,  I've looked through 
> the wiki (https://mahout.apache.org/users/basics/algorithms.html) and I'm 
> interested in implementing the one or more of the following algorithms with 
> Mahout using spark: 1) Matrix Factorization with ALS 2) Naive Bayes 3) 
> Weighted Matrix Factorization, SVD++ 4) Sparse TF-IDF Vectors from Text 5) 
> Lucene integration.
> Had a few questions:1) Which of these should I start with and where is there 
> the greatest need?2) Should I fork the repo and create branches for the each 
> of the above implementations?3) Should I go ahead and create some JIRAs for 
> these?
> Would love to have some pointers to get started?Regards
> 
> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: Mahout contributions
> Date: Wed, 30 Mar 2016 10:23:45 -0700
> 
> 
> 
> 
> Hello Committers,I was looking through the current jira tickets and was 
> wondering if there's a particular area of Mahout that needs some more help 
> than others, should I focus on contributing some algorithms usign DSL or 
> Samsara related efforts, I've finally got some bandwidth to do some work and 
> would love some guidance before assigning myself some tickets.Regards
  

Jenkins build is back to stable : mahout-nightly » Mahout Flink bindings #2074

2016-04-27 Thread Apache Jenkins Server
See 




Jenkins build is back to stable : mahout-nightly #2074

2016-04-27 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : Mahout-Quality #3345

2016-04-27 Thread Apache Jenkins Server
See 



[jira] [Resolved] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo resolved MAHOUT-1836.

   Resolution: Fixed
 Assignee: Andrew Palumbo
Fix Version/s: 0.12.1

Thanks for the contribution [~mutekinootoko]!  Please do let us know if you see 
any more issues.

> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Marku
>Assignee: Andrew Palumbo
>Priority: Trivial
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1836:
---
Affects Version/s: 0.12.0

> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Marku
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1836:
---
Component/s: Documentation

> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Marku
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261110#comment-15261110
 ] 

ASF GitHub Bot commented on MAHOUT-1836:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/227


> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Documentation
>Reporter: Marku
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work stopped] (MAHOUT-1750) Mahout DSL for Flink: Implement ABt

2016-04-27 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1750 stopped by Andrew Palumbo.
--
> Mahout DSL for Flink: Implement ABt
> ---
>
> Key: MAHOUT-1750
> URL: https://issues.apache.org/jira/browse/MAHOUT-1750
> Project: Mahout
>  Issue Type: Task
>  Components: Flink, Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.1
>
>
> Now ABt is expressed through AtB, which is not optimal, and we need to have a 
> special implementation for ABt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-27 Thread Pat Ferrel
I have been using the same function through all those versions of Mahout. I’m 
running on newer versions of Spark 1.4-1.6.2. Using my datasets there has been 
no slowdown. I assume that you are only changing the Mahout version—leaving 
data, Spark, HDFS, and all config the same. In which case I wonder if you are 
somehow running into limits of your machine like memory? Have you allocated a 
fixed executor memory limit?

There has been almost no code change to item similarity. Dmitriy, do you know 
if the underlying AtB has changed? I seem to recall the partitioning was set to 
“auto” about 0.11. We were having problems with large numbers of small part 
files from Spark Streaming causing partitioning headaches as I recall. In some 
unexpected way the input structure was trickling down into partitioning 
decisions made in Spark. 

The first thing I’d try is giving the job more executor memory, the second is 
to upgrade Spark. A 3x increase in execution speed is a pretty big deal if it 
isn’t helped with these easy fixes so can you share your data? 

On Apr 27, 2016, at 8:37 AM, Dmitriy Lyubimov  wrote:

0.11 targets 1.3+.

I don't quite have anything on top of my head affecting A'B specifically,
but i think there were some chanages affecting in-memory multiplication
(which is of course used in distributed A'B).

I am not in particular familiar or remember details of row similarity on
top of my head, i really wish the original contributor would comment on
that. trying to see if i can come up with anything useful though.

what behavior do you see in this job -- cpu-bound or i/o bound?

there are a few pointers to look at:

(1)  I/O many times exceeds the input size, so spills are inevitable. So
tuning memory sizes and look at spark spill locations to make sure disks
are not slow there is critical. Also, i think in spark 1.6 spark added a
lot of flexibility in managing task/cache/shuffle memory sizes, it may help
in some unexpected way.

(2) sufficient cache: many pipelines commit reused matrices into cache
(MEMORY_ONLY) which is the default mahout algebra behavior, assuming there
is enough cache memory there for only good things to happen. if it is not,
however, it will cause recomputation of results that were evicted. (not
saying it is a known case for row similarity in particular). make sure this
is not the case. For cases of scatter type exchanges it is especially super
bad.

(3) A'B -- try to hack and play with implemetnation there in AtB (spark
side) class. See if you can come up with a better arrangement.

(4) in-memory computations (MMul class) if that's the bottleneck can be in
practice quick-hacked with mutlithreaded multiplication and bridge to
native solvers (netlib-java) at least for dense cases. this is found to
improve performance of distributed multiplications a bit. Works best if you
get 2 threads in the backend and all threads in the front end.

There are other known things that can improve speed multiplication of the
public mahout version, i hope mahout will improve on those in the future.

-d

On Wed, Apr 27, 2016 at 6:14 AM, Nikaash Puri  wrote:

> Hi,
> 
> I’ve been working with LLR in Mahout for a while now. Mostly using the
> SimilarityAnalysis.cooccurenceIDss function. I recently upgraded the Mahout
> libraries to 0.11, and subsequently also tried with 0.12 and the same
> program is running orders of magnitude slower (at least 3x based on initial
> analysis).
> 
> Looking into the tasks more carefully, comparing 0.10 and 0.11 shows that
> the amount of Shuffle being done in 0.11 is significantly higher,
> especially in the AtB step. This could possibly be a reason for the
> reduction in performance.
> 
> Although, I am working on Spark 1.2.0. So, its possible that this could be
> causing the problem. It works fine with Mahout 0.10.
> 
> Any ideas why this might be happening?
> 
> Thank you,
> Nikaash Puri



Re: Mahout contributions

2016-04-27 Thread Andrew Palumbo
Hello Saikat,

#1 and #2 above are already implemented.  #4 is tricky so i would not recommend 
without a strong knowledge of the codebase, and #5 is now deprecated.  (I've 
just updated the algorithms grid to reflect this).  The algorithms page 
includes both algorithms implemented in the math-scala library and algorithms 
which have CLI drivers written for them.  

Please see: http://mahout.apache.org/developers/how-to-contribute.html

And please note that per that documentation, it is in everybody's best interest 
to keep messages on list, contacting committers directly is discouraged.

The best way to contribute (if you have not found a new bug or issue) would be 
for you to pick a single open issue in the mahout JIRA which is not already 
assigned, and start work on it.  When your work is ready for review, just open 
up a PR and the committers will review it.  Please note that if you do pick up 
an issue to work on, we do expect some amount of responsibility and reliability 
and tangible amount of satisfactory work since once you've marked a JIRA as 
something you're working on, others will pass on it.

Another good way to contribute would be to look for enhancements that could 
make to existing code not necessarily open JIRAs that need to be assigned to 
you.  For example please see the recent contribution and workflow on: 
https://issues.apache.org/jira/browse/MAHOUT-1833 .

If you have something new that you'd like to implement, simply start a new JIRA 
issue and begin work on it.  In this case, when you have some code that is 
ready for review,  you can simply open up a PR for it and committers will 
review it.  For new implementations, we generally say that you should do this 
when you are at least 70-80% finished with your coding.

Thank You,

Andy




From: Saikat Kanjilal 
Sent: Tuesday, April 26, 2016 7:17 PM
To: dev@mahout.apache.org
Subject: RE: Mahout contributions

Hello,Following up on my last email with more specifics,  I've looked through 
the wiki (https://mahout.apache.org/users/basics/algorithms.html) and I'm 
interested in implementing the one or more of the following algorithms with 
Mahout using spark: 1) Matrix Factorization with ALS 2) Naive Bayes 3) Weighted 
Matrix Factorization, SVD++ 4) Sparse TF-IDF Vectors from Text 5) Lucene 
integration.
Had a few questions:1) Which of these should I start with and where is there 
the greatest need?2) Should I fork the repo and create branches for the each of 
the above implementations?3) Should I go ahead and create some JIRAs for these?
Would love to have some pointers to get started?Regards

From: sxk1...@hotmail.com
To: dev@mahout.apache.org
Subject: Mahout contributions
Date: Wed, 30 Mar 2016 10:23:45 -0700




Hello Committers,I was looking through the current jira tickets and was 
wondering if there's a particular area of Mahout that needs some more help than 
others, should I focus on contributing some algorithms usign DSL or Samsara 
related efforts, I've finally got some bandwidth to do some work and would love 
some guidance before assigning myself some tickets.Regards


[jira] [Created] (MAHOUT-1837) Sparse/Dense Matrix analysis for Matrix Multiplication

2016-04-27 Thread Andrew Palumbo (JIRA)
Andrew Palumbo created MAHOUT-1837:
--

 Summary: Sparse/Dense Matrix analysis for Matrix Multiplication
 Key: MAHOUT-1837
 URL: https://issues.apache.org/jira/browse/MAHOUT-1837
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Affects Versions: 0.12.0
Reporter: Andrew Palumbo
 Fix For: 0.12.1


In matrix multiplication, Sparse Matrices can easily turn dense and bloat 
memory,  one fully dense column and one fully dense row can cause a sparse %*% 
sparse operation have a dense result.  

There are two issues here one with a quick Fix and one a bit more involved:
   #  in {{ABt.Scala}} use check the `MatrixFlavor` of the combiner and use the 
flavor of the Block as the resulting Sparse or Dense matrix type:
{code}
val comb = if (block.getFlavor == MatrixFlavor.SPARSELIKE) {
  new SparseMatrix(prodNCol, block.nrow).t
} else {
  new DenseMatrix(prodNCol, block.nrow).t
}
{code}
 a simlar check needs to be made in the {{blockify}} transformation.
 
   #  More importantly, and more involved is to do an actual analysis of the 
resulting matrix data in the in-core {{mmul}} class and use a matrix of the 
appropriate Structure as a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


About reuters-fkmeans-centroids

2016-04-27 Thread Prakash Poudyal
Hi!

I am using fuzzy clustering, but I could not understand "  -c
reuters-fkmeans-centroids  ". How to calculate this ?


$ /bin/mahout fkmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2
-ow -x 10 -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

-- 

Regards
Prakash Poudyal


[jira] [Commented] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260485#comment-15260485
 ] 

ASF GitHub Bot commented on MAHOUT-1836:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/227#issuecomment-215148971
  
lgtm


> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Documentation
>Reporter: Marku
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-27 Thread Dmitriy Lyubimov
0.11 targets 1.3+.

I don't quite have anything on top of my head affecting A'B specifically,
but i think there were some chanages affecting in-memory multiplication
(which is of course used in distributed A'B).

I am not in particular familiar or remember details of row similarity on
top of my head, i really wish the original contributor would comment on
that. trying to see if i can come up with anything useful though.

what behavior do you see in this job -- cpu-bound or i/o bound?

there are a few pointers to look at:

(1)  I/O many times exceeds the input size, so spills are inevitable. So
tuning memory sizes and look at spark spill locations to make sure disks
are not slow there is critical. Also, i think in spark 1.6 spark added a
lot of flexibility in managing task/cache/shuffle memory sizes, it may help
in some unexpected way.

(2) sufficient cache: many pipelines commit reused matrices into cache
(MEMORY_ONLY) which is the default mahout algebra behavior, assuming there
is enough cache memory there for only good things to happen. if it is not,
however, it will cause recomputation of results that were evicted. (not
saying it is a known case for row similarity in particular). make sure this
is not the case. For cases of scatter type exchanges it is especially super
bad.

(3) A'B -- try to hack and play with implemetnation there in AtB (spark
side) class. See if you can come up with a better arrangement.

(4) in-memory computations (MMul class) if that's the bottleneck can be in
practice quick-hacked with mutlithreaded multiplication and bridge to
native solvers (netlib-java) at least for dense cases. this is found to
improve performance of distributed multiplications a bit. Works best if you
get 2 threads in the backend and all threads in the front end.

There are other known things that can improve speed multiplication of the
public mahout version, i hope mahout will improve on those in the future.

-d

On Wed, Apr 27, 2016 at 6:14 AM, Nikaash Puri  wrote:

> Hi,
>
> I’ve been working with LLR in Mahout for a while now. Mostly using the
> SimilarityAnalysis.cooccurenceIDss function. I recently upgraded the Mahout
> libraries to 0.11, and subsequently also tried with 0.12 and the same
> program is running orders of magnitude slower (at least 3x based on initial
> analysis).
>
> Looking into the tasks more carefully, comparing 0.10 and 0.11 shows that
> the amount of Shuffle being done in 0.11 is significantly higher,
> especially in the AtB step. This could possibly be a reason for the
> reduction in performance.
>
> Although, I am working on Spark 1.2.0. So, its possible that this could be
> causing the problem. It works fine with Mahout 0.10.
>
> Any ideas why this might be happening?
>
> Thank you,
> Nikaash Puri


spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-27 Thread Nikaash Puri
Hi,

I’ve been working with LLR in Mahout for a while now. Mostly using the 
SimilarityAnalysis.cooccurenceIDss function. I recently upgraded the Mahout 
libraries to 0.11, and subsequently also tried with 0.12 and the same program 
is running orders of magnitude slower (at least 3x based on initial analysis). 

Looking into the tasks more carefully, comparing 0.10 and 0.11 shows that the 
amount of Shuffle being done in 0.11 is significantly higher, especially in the 
AtB step. This could possibly be a reason for the reduction in performance. 

Although, I am working on Spark 1.2.0. So, its possible that this could be 
causing the problem. It works fine with Mahout 0.10. 

Any ideas why this might be happening?

Thank you,
Nikaash Puri

[jira] [Commented] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259898#comment-15259898
 ] 

ASF GitHub Bot commented on MAHOUT-1836:


Github user mutekinootoko commented on the pull request:

https://github.com/apache/mahout/pull/227#issuecomment-215035826
  
@andrewpalumbo  sure


> Order and add missing paramters for 
> DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
> -
>
> Key: MAHOUT-1836
> URL: https://issues.apache.org/jira/browse/MAHOUT-1836
> Project: Mahout
>  Issue Type: Documentation
>Reporter: Marku
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1836) Order and add missing paramters for DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.

2016-04-27 Thread Marku (JIRA)
Marku created MAHOUT-1836:
-

 Summary: Order and add missing paramters for 
DictionaryVectorizer.createTermFrequencyVectors() javadoc parameter comments.
 Key: MAHOUT-1836
 URL: https://issues.apache.org/jira/browse/MAHOUT-1836
 Project: Mahout
  Issue Type: Documentation
Reporter: Marku
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)