[
https://issues.apache.org/jira/browse/MAHOUT-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112686#comment-14112686
]
ASF GitHub Bot commented on MAHOUT-1604:
----------------------------------------
GitHub user pferrel opened a pull request:
https://github.com/apache/mahout/pull/47
MAHOUT-1604 Row Similarity
This implements a driver for spark-rowsimilarity along with supporting
code. This led to a refactoring of CooccurrenceAnalysis into SimilarityAnalysis
since there are now two primary methods for item (column) similarity and row
similarity.
Added read code to support reading extended DRM (text encoded with
application specific ID) as well as one element per row encoding. Write still
only supports extended DRMs.
Changes relate to MAHOUT-1568 and MAHOUT-1569: Simplified creating a driver
and parsing options.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/pferrel/mahout mahout-1604
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/47.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #47
----
commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel <[email protected]>
Date: 2014-06-04T19:54:22Z
added Sebastian's CooccurrenceAnalysis patch updated it to use current
Mahout-DSL
commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
Author: pferrel <[email protected]>
Date: 2014-06-04T20:09:07Z
Adding stuff for itemsimilarity driver for Spark
commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
Author: pferrel <[email protected]>
Date: 2014-06-04T20:13:13Z
added scopt to pom deps
commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel <[email protected]>
Date: 2014-06-04T21:32:18Z
added Sebastian's MurmurHash changes
Signed-off-by: pferrel <[email protected]>
commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
Author: pferrel <[email protected]>
Date: 2014-06-04T21:33:11Z
Merge branch 'mahout-1464' into mahout-1541
commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
Author: pferrel <[email protected]>
Date: 2014-06-05T01:44:16Z
MAHOUT-1541 still working on this, some refactoring in the DSL for
abstracting away Spark has moved access to rddsno Jira is closed yet
commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel <[email protected]>
Date: 2014-06-05T16:52:23Z
MAHOUT-1464 import cleanup, minor changes to examples for running on Spark
Cluster
commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
Author: pferrel <[email protected]>
Date: 2014-06-05T16:55:09Z
Merge branch 'mahout-1464' into mahout-1541
commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
Author: pferrel <[email protected]>
Date: 2014-06-07T20:39:25Z
Merge branch 'master' into mahout-1541
commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
Author: pferrel <[email protected]>
Date: 2014-06-07T21:27:06Z
drmWrap seems to be the answer to the changed DrmLike interface. Code works
again but more to do.
commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
Author: pferrel <[email protected]>
Date: 2014-06-17T16:00:38Z
merged master, got new cooccurrence code
commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
Author: pferrel <[email protected]>
Date: 2014-06-19T17:08:02Z
for high level review, not ready for merge
commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
Author: pferrel <[email protected]>
Date: 2014-06-19T17:46:23Z
for high level review, not ready for merge. changed to dot notation
commit f62ab071869ee205ad398a3e094d871138e11a9e
Author: pferrel <[email protected]>
Date: 2014-06-19T18:13:44Z
for high level review, not ready for merge. fixed a couple scaladoc refs
commit cbef0ee6264c28d0597cb2507427a647771c9bcd
Author: pferrel <[email protected]>
Date: 2014-06-23T21:49:20Z
adding tests, had to modify some test framework Scala to make the masterUrl
visible to tests
commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
Author: pferrel <[email protected]>
Date: 2014-06-25T15:41:54Z
adding more tests for ItemSimilarityDriver
commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
Author: pferrel <[email protected]>
Date: 2014-06-25T15:44:47Z
merged master changes and fixed a pom conflict
commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
Author: pferrel <[email protected]>
Date: 2014-06-25T16:57:55Z
remove tmp after all tests, fixed dangling comma in input file list
commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
Author: pferrel <[email protected]>
Date: 2014-06-26T22:19:48Z
changes to pom, mahout driver script, and cleaned up help text
commit 213b18dee259925de82c703451bdea640e1f068e
Author: pferrel <[email protected]>
Date: 2014-06-26T22:26:17Z
added a job.xml assembly for creation of an all-dependencies jar
commit 627d39f30860e4ab43783c72cc2cf8926060b73c
Author: pferrel <[email protected]>
Date: 2014-06-27T16:44:37Z
registered HashBiMap with JavaSerializer in Kryo
commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
Author: pferrel <[email protected]>
Date: 2014-06-27T21:30:13Z
increased the default max heep for mahout/JVM to 4g, using max of 4g for
Spark executor
commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
Author: Pat Ferrel <[email protected]>
Date: 2014-06-30T17:06:49Z
tweaking memory requirements to process epinions with the
ItemSimilarityDriver
commit 6ec98f32775c791ee001fc996f475215e427f368
Author: pferrel <[email protected]>
Date: 2014-06-30T17:08:49Z
refactored to use a DistributedContext instead of raw SparkContext
commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
Author: pferrel <[email protected]>
Date: 2014-06-30T17:08:59Z
merging changes made on to run a large dataset through itemsimilarity on
the cluster
commit 8e70091a564c8464ea70bf90006d8124c3a7f208
Author: pferrel <[email protected]>
Date: 2014-06-30T20:11:42Z
fixed a bug, SparkConf in driver was ignored and blank one passed in to
create a DistributedContext
commit 01a0341f56071d2244aabd6de8c6f528ad35b164
Author: pferrel <[email protected]>
Date: 2014-06-30T20:33:39Z
added option for configuring Spark executor memory
commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
Author: pferrel <[email protected]>
Date: 2014-06-30T22:37:19Z
removed some outdated examples
commit 9fb281022cba7666dd26701b3d97d200b13c35f8
Author: pferrel <[email protected]>
Date: 2014-07-01T18:17:42Z
test naming and pom changed to up the jvm heap max to 512m for scalatests
commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
Author: pferrel <[email protected]>
Date: 2014-07-01T18:18:59Z
Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into
mahout-1541
----
> Create a RowSimilarity for Spark
> --------------------------------
>
> Key: MAHOUT-1604
> URL: https://issues.apache.org/jira/browse/MAHOUT-1604
> Project: Mahout
> Issue Type: Bug
> Components: CLI
> Affects Versions: 1.0
> Environment: Spark
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Using CooccurrenceAnalysis.cooccurrence create a driver that reads a text DRM
> or two and produces LLR similarity/cross-similarity matrices.
> This will produce the same results as ItemSimilarity but take a Drm as input
> instead of individual cells.
> The first version will only support LLR, other similarity measures will need
> to be in separate Jiras
--
This message was sent by Atlassian JIRA
(v6.2#6252)