[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065509#comment-14065509
]
ASF GitHub Bot commented on MAHOUT-1541:
----------------------------------------
GitHub user pferrel opened a pull request:
https://github.com/apache/mahout/pull/31
MAHOUT-1541
Covers MAHOUT-1464, and MAHOUT-1541
Fixes cross-cooccurrence, which originally was of the wrong order. If A is
the primary interaction matrix, them cross should be A'B but was B'A. Also
added tests for A and B of different column cardinality, which should is
allowed.
Looked a bit deeper regarding @dev thread on changing cardinality of a
sparse matrix. This is needed in ItemSimilarity because only once all
cross-interaction matrices have been read in can the true cardinality of all be
known. The much more typical case is that they can be computed from the input.
So there needs to be a method to update the cardinality after the matrices have
been read in.
This implementation creates a new abstract method in CheckpointedDrm and
implementation in CheckpointedDrmSpark. Here is the reasoning.
1) for sparse DRMs there is no need for any representation of an empty row
(or column) not even the keys need to be known only the cardinality. You only
have to think about a transpose of sparse vectors to see that this must be so.
Further it works and I’ve relied on it since the Hadoop mr version. Baring any
revelation from the math gods--it is so.
2) rbind semantics apply to dense matrices. This IMO should be avoided in
this case because even if we rejigger rbind to only change the cardinality
without inserting real rows it would seem to violate its semantics. Sparse
matrices don’t fit the default R semantics in a few areas (in my non-expert
opinion) and this is one. Unless someone feels strongly it will be in
CheckpointedDrm as abstract and implemented in
CheckpoimtedDrmSpark#addToRowCardinality(n: Int): Unit. Creating an op that
returns a new CheckpointedDrm is also possible if there is something unsafe
about my implementation, but rbind?
3) I have implemented this so that there is no call to drm.nrow neither to
read nor modify it. So it will remain lazy evaluated until needed by other math.
4) ItemSimilarity for the A’B case now passes several asymmetric input
cases and outputs the correct external IDs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/pferrel/mahout mahout-1541
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/31.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #31
----
commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel <[email protected]>
Date: 2014-06-04T19:54:22Z
added Sebastian's CooccurrenceAnalysis patch updated it to use current
Mahout-DSL
commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
Author: pferrel <[email protected]>
Date: 2014-06-04T20:09:07Z
Adding stuff for itemsimilarity driver for Spark
commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
Author: pferrel <[email protected]>
Date: 2014-06-04T20:13:13Z
added scopt to pom deps
commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel <[email protected]>
Date: 2014-06-04T21:32:18Z
added Sebastian's MurmurHash changes
Signed-off-by: pferrel <[email protected]>
commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
Author: pferrel <[email protected]>
Date: 2014-06-04T21:33:11Z
Merge branch 'mahout-1464' into mahout-1541
commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
Author: pferrel <[email protected]>
Date: 2014-06-05T01:44:16Z
MAHOUT-1541 still working on this, some refactoring in the DSL for
abstracting away Spark has moved access to rddsno Jira is closed yet
commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel <[email protected]>
Date: 2014-06-05T16:52:23Z
MAHOUT-1464 import cleanup, minor changes to examples for running on Spark
Cluster
commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
Author: pferrel <[email protected]>
Date: 2014-06-05T16:55:09Z
Merge branch 'mahout-1464' into mahout-1541
commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
Author: pferrel <[email protected]>
Date: 2014-06-07T20:39:25Z
Merge branch 'master' into mahout-1541
commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
Author: pferrel <[email protected]>
Date: 2014-06-07T21:27:06Z
drmWrap seems to be the answer to the changed DrmLike interface. Code works
again but more to do.
commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
Author: pferrel <[email protected]>
Date: 2014-06-17T16:00:38Z
merged master, got new cooccurrence code
commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
Author: pferrel <[email protected]>
Date: 2014-06-19T17:08:02Z
for high level review, not ready for merge
commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
Author: pferrel <[email protected]>
Date: 2014-06-19T17:46:23Z
for high level review, not ready for merge. changed to dot notation
commit f62ab071869ee205ad398a3e094d871138e11a9e
Author: pferrel <[email protected]>
Date: 2014-06-19T18:13:44Z
for high level review, not ready for merge. fixed a couple scaladoc refs
commit cbef0ee6264c28d0597cb2507427a647771c9bcd
Author: pferrel <[email protected]>
Date: 2014-06-23T21:49:20Z
adding tests, had to modify some test framework Scala to make the masterUrl
visible to tests
commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
Author: pferrel <[email protected]>
Date: 2014-06-25T15:41:54Z
adding more tests for ItemSimilarityDriver
commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
Author: pferrel <[email protected]>
Date: 2014-06-25T15:44:47Z
merged master changes and fixed a pom conflict
commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
Author: pferrel <[email protected]>
Date: 2014-06-25T16:57:55Z
remove tmp after all tests, fixed dangling comma in input file list
commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
Author: pferrel <[email protected]>
Date: 2014-06-26T22:19:48Z
changes to pom, mahout driver script, and cleaned up help text
commit 213b18dee259925de82c703451bdea640e1f068e
Author: pferrel <[email protected]>
Date: 2014-06-26T22:26:17Z
added a job.xml assembly for creation of an all-dependencies jar
commit 627d39f30860e4ab43783c72cc2cf8926060b73c
Author: pferrel <[email protected]>
Date: 2014-06-27T16:44:37Z
registered HashBiMap with JavaSerializer in Kryo
commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
Author: pferrel <[email protected]>
Date: 2014-06-27T21:30:13Z
increased the default max heep for mahout/JVM to 4g, using max of 4g for
Spark executor
commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
Author: Pat Ferrel <[email protected]>
Date: 2014-06-30T17:06:49Z
tweaking memory requirements to process epinions with the
ItemSimilarityDriver
commit 6ec98f32775c791ee001fc996f475215e427f368
Author: pferrel <[email protected]>
Date: 2014-06-30T17:08:49Z
refactored to use a DistributedContext instead of raw SparkContext
commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
Author: pferrel <[email protected]>
Date: 2014-06-30T17:08:59Z
merging changes made on to run a large dataset through itemsimilarity on
the cluster
commit 8e70091a564c8464ea70bf90006d8124c3a7f208
Author: pferrel <[email protected]>
Date: 2014-06-30T20:11:42Z
fixed a bug, SparkConf in driver was ignored and blank one passed in to
create a DistributedContext
commit 01a0341f56071d2244aabd6de8c6f528ad35b164
Author: pferrel <[email protected]>
Date: 2014-06-30T20:33:39Z
added option for configuring Spark executor memory
commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
Author: pferrel <[email protected]>
Date: 2014-06-30T22:37:19Z
removed some outdated examples
commit 9fb281022cba7666dd26701b3d97d200b13c35f8
Author: pferrel <[email protected]>
Date: 2014-07-01T18:17:42Z
test naming and pom changed to up the jvm heap max to 512m for scalatests
commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
Author: pferrel <[email protected]>
Date: 2014-07-01T18:18:59Z
Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into
mahout-1541
----
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: New Feature
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)