[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-4823: Labels: bulk-closed (was: ) > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh >Priority: Major > Labels: bulk-closed > Attachments: MovieLensSimilarity Comparisons.pdf, > SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4823: Attachment: SparkMeetup2015-Experiments2.pdf SparkMeetup2015-Experiments1.pdf > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf, > SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4823: Attachment: MovieLensSimilarity Comparisons.pdf The attached file shows the runtime comparison of row and column based flow on all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 4 GB executor memory. 1e-2 is the threshold that's being set to both row based kernel flow and column based dimsum flow. Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins This shows the power of blocking in Spark and I have not yet gone to gemv which will decrease the runtime further. I updated the driver code in examples.mllib.MovieLensSimilarity > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org