Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
Matt I’ll create a feature branch of Mahout in my git repo for simplicity (we are in code freeze for Mahout right now) Then if you could peel off you changes and make a PR against it. Everyone can have a look before any change is made to the ASF repos. Do a PR against this

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
I should mention that the densisty is currently set quite high, and we've been discussing a user defined setting for this. Something that we have not worked in yet. From: Andrew Palumbo Sent: Monday, August 21, 2017 2:44:35 PM To:

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
We do currently have optimizations based on density analysis in use e.g.: in AtB. https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431 +1 to PR. thanks for pointing this out. --andy

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
Is it possible to add it to Mahout so as to get the unit tests run? If so we also have a bunch of integration tests as well as my real-world data. Again, I don’t see anything wrong with skipping zeros in any case but this method is known to be slower for certain types of math (IIRC). So I’d bet

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
That looks like ancient code from the old mapreduce days. If is passes unit tests create a PR. Just a guess here but there are times when this might not speed up thing but slow them down. However for vey sparse matrixes that you might see in CF this could work quite well. Some of the GPU

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Scruggs, Matt
Good question :D For the dataset I mentioned in my first message, the entire run is almost 10x faster (I expect that speedup to be non-linear since it nearly eliminates a for loop...bigger gains for bigger datasets). It's possible there are other sections of the code I can't override (e.g.

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
Interesting indeed. What is “massive”? Does the change pass all unit tests? On Aug 17, 2017, at 1:04 PM, Scruggs, Matt wrote: Thanks for the remarks guys! I profiled the code running locally on my machine and discovered this loop is where these setQuick() and