[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

ASF GitHub Bot (JIRA) Thu, 17 Jul 2014 13:35:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065509#comment-14065509
 ]


ASF GitHub Bot commented on MAHOUT-1541:
----------------------------------------

GitHub user pferrel opened a pull request:

    https://github.com/apache/mahout/pull/31

    MAHOUT-1541

    Covers MAHOUT-1464, and MAHOUT-1541
    
    Fixes cross-cooccurrence, which originally was of the wrong order. If A is 
the primary interaction matrix, them cross should be A'B but was B'A. Also 
added tests for A and B of different column cardinality, which should is 
allowed.
    
    Looked a bit deeper regarding @dev thread on changing cardinality of a 
sparse matrix. This is needed in ItemSimilarity because only once all 
cross-interaction matrices have been read in can the true cardinality of all be 
known. The much more typical case is that they can be computed from the input. 
So there needs to be a method to update the cardinality after the matrices have 
been read in.
    
    This implementation creates a new abstract method in CheckpointedDrm and 
implementation in CheckpointedDrmSpark. Here is the reasoning.
    
    1) for sparse DRMs there is no need for any representation of an empty row 
(or column) not even the keys need to be known only the cardinality. You only 
have to think about a transpose of sparse vectors to see that this must be so. 
Further it works and I’ve relied on it since the Hadoop mr version. Baring any 
revelation from the math gods--it is so.
    
    2) rbind semantics apply to dense matrices. This IMO should be avoided in 
this case because even if we rejigger rbind to only change the cardinality 
without inserting real rows it would seem to violate its semantics. Sparse 
matrices don’t fit the default R semantics in a few areas (in my non-expert 
opinion) and this is one. Unless someone feels strongly it will be in 
CheckpointedDrm as abstract and implemented in 
CheckpoimtedDrmSpark#addToRowCardinality(n: Int): Unit. Creating an op that 
returns a new CheckpointedDrm is also possible if there is something unsafe 
about my implementation, but rbind? 
    
    3) I have implemented this so that there is no call to drm.nrow neither to 
read nor modify it. So it will remain lazy evaluated until needed by other math.
    
    4) ItemSimilarity for the A’B case now passes several asymmetric input 
cases and outputs the correct external IDs.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pferrel/mahout mahout-1541

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mahout/pull/31.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #31
    
----
commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel <[email protected]>
Date:   2014-06-04T19:54:22Z

    added Sebastian's CooccurrenceAnalysis patch updated it to use current 
Mahout-DSL

commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
Author: pferrel <[email protected]>
Date:   2014-06-04T20:09:07Z

    Adding stuff for itemsimilarity driver for Spark

commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
Author: pferrel <[email protected]>
Date:   2014-06-04T20:13:13Z

    added scopt to pom deps

commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel <[email protected]>
Date:   2014-06-04T21:32:18Z

    added Sebastian's MurmurHash changes
    
    Signed-off-by: pferrel <[email protected]>

commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
Author: pferrel <[email protected]>
Date:   2014-06-04T21:33:11Z

    Merge branch 'mahout-1464' into mahout-1541

commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
Author: pferrel <[email protected]>
Date:   2014-06-05T01:44:16Z

    MAHOUT-1541 still working on this, some refactoring in the DSL for 
abstracting away Spark has moved access to rddsno Jira is closed yet

commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel <[email protected]>
Date:   2014-06-05T16:52:23Z

    MAHOUT-1464 import cleanup, minor changes to examples for running on Spark 
Cluster

commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
Author: pferrel <[email protected]>
Date:   2014-06-05T16:55:09Z

    Merge branch 'mahout-1464' into mahout-1541

commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
Author: pferrel <[email protected]>
Date:   2014-06-07T20:39:25Z

    Merge branch 'master' into mahout-1541

commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
Author: pferrel <[email protected]>
Date:   2014-06-07T21:27:06Z

    drmWrap seems to be the answer to the changed DrmLike interface. Code works 
again but more to do.

commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
Author: pferrel <[email protected]>
Date:   2014-06-17T16:00:38Z

    merged master, got new cooccurrence code

commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
Author: pferrel <[email protected]>
Date:   2014-06-19T17:08:02Z

    for high level review, not ready for merge

commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
Author: pferrel <[email protected]>
Date:   2014-06-19T17:46:23Z

    for high level review, not ready for merge. changed to dot notation

commit f62ab071869ee205ad398a3e094d871138e11a9e
Author: pferrel <[email protected]>
Date:   2014-06-19T18:13:44Z

    for high level review, not ready for merge. fixed a couple scaladoc refs

commit cbef0ee6264c28d0597cb2507427a647771c9bcd
Author: pferrel <[email protected]>
Date:   2014-06-23T21:49:20Z

    adding tests, had to modify some test framework Scala to make the masterUrl 
visible to tests

commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
Author: pferrel <[email protected]>
Date:   2014-06-25T15:41:54Z

    adding more tests for ItemSimilarityDriver

commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
Author: pferrel <[email protected]>
Date:   2014-06-25T15:44:47Z

    merged master changes and fixed a pom conflict

commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
Author: pferrel <[email protected]>
Date:   2014-06-25T16:57:55Z

    remove tmp after all tests, fixed dangling comma in input file list

commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
Author: pferrel <[email protected]>
Date:   2014-06-26T22:19:48Z

    changes to pom, mahout driver script, and cleaned up help text

commit 213b18dee259925de82c703451bdea640e1f068e
Author: pferrel <[email protected]>
Date:   2014-06-26T22:26:17Z

    added a job.xml assembly for creation of an all-dependencies jar

commit 627d39f30860e4ab43783c72cc2cf8926060b73c
Author: pferrel <[email protected]>
Date:   2014-06-27T16:44:37Z

    registered HashBiMap with JavaSerializer in Kryo

commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
Author: pferrel <[email protected]>
Date:   2014-06-27T21:30:13Z

    increased the default max heep for mahout/JVM to 4g, using max of 4g for 
Spark executor

commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
Author: Pat Ferrel <[email protected]>
Date:   2014-06-30T17:06:49Z

    tweaking memory requirements to process epinions with the 
ItemSimilarityDriver

commit 6ec98f32775c791ee001fc996f475215e427f368
Author: pferrel <[email protected]>
Date:   2014-06-30T17:08:49Z

    refactored to use a DistributedContext instead of raw SparkContext

commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
Author: pferrel <[email protected]>
Date:   2014-06-30T17:08:59Z

    merging changes made on to run a large dataset through itemsimilarity on 
the cluster

commit 8e70091a564c8464ea70bf90006d8124c3a7f208
Author: pferrel <[email protected]>
Date:   2014-06-30T20:11:42Z

    fixed a bug, SparkConf in driver was ignored and blank one passed in to 
create a DistributedContext

commit 01a0341f56071d2244aabd6de8c6f528ad35b164
Author: pferrel <[email protected]>
Date:   2014-06-30T20:33:39Z

    added option for configuring Spark executor memory

commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
Author: pferrel <[email protected]>
Date:   2014-06-30T22:37:19Z

    removed some outdated examples

commit 9fb281022cba7666dd26701b3d97d200b13c35f8
Author: pferrel <[email protected]>
Date:   2014-07-01T18:17:42Z

    test naming and pom changed to up the jvm heap max to 512m for scalatests

commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
Author: pferrel <[email protected]>
Date:   2014-07-01T18:18:59Z

    Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into 
mahout-1541

----


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: New Feature
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to