[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

ASF GitHub Bot (JIRA) Wed, 30 Jul 2014 13:15:07 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079886#comment-14079886
 ]


ASF GitHub Bot commented on MAHOUT-1541:
----------------------------------------

GitHub user pferrel opened a pull request:

    https://github.com/apache/mahout/pull/36

    MAHOUT-1541

    Parts of this address MAHOUT-1541, MAHOUT-1568, and MAHOUT-1569
    
    The previous merge of MAHOUT-1541 was supporting A'A primarily, this merge 
support A'B as well with all features. Lots of refactoring and new tests for A 
and B of different cardinality and using different item ID spaces. Took the 
forced cardinality matching from Math and put in the data prep part. This means 
passing in an nrow to drmWrap, which may be larger than the actual number of 
rows embodied in the drm/rdd. I've added tests for B.t %*% A as well as the 
actual driver for these cases (missing row cases).
    
    Can't complete the full epinions cross-cooccurrence on a single machine 
with an out of Java heap exception. So I'm now testing it on a cluster. The 
cooccurrence of A'A does complete on a single machine.
    
    One known improvement is to limit the use of dictionaries if they are not 
need and to look at replacing the Guava HashBiMap with a minimal Scala verison. 
This version uses dictionaries for IDs even if the input is using Mahout 
sequential int IDs.
    
    MAHOUT-1568: Proposed standards for text versions of DRM-ish output. These 
preserve the IDs of the application while using Mahout IDs internally. In other 
words output has application IDs. There are several configurable readers and 
writers of TD files. Reading Tuples into a DRM is implemented, Writing a 
DRM-ish TD file is also implemented.
    
    MAHOUT-1569: There is a refactored MahoutOptionParser and MahoutDriver with 
some default behavior that should make creating drivers a bit easier and DRYer 
than the last merge. The options are being proposed as standards across all 
drivers so we have one way to specify the formats of input/output files and 
other common options.
    
    This is for comment, I won't merge until several larger dataset are working 
on a cluster.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pferrel/mahout mahout-1541

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mahout/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #36
    
----
commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-04T19:54:22Z

    added Sebastian's CooccurrenceAnalysis patch updated it to use current 
Mahout-DSL

commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-04T20:09:07Z

    Adding stuff for itemsimilarity driver for Spark

commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-04T20:13:13Z

    added scopt to pom deps

commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-04T21:32:18Z

    added Sebastian's MurmurHash changes
    
    Signed-off-by: pferrel <p...@occamsmachete.com>

commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-04T21:33:11Z

    Merge branch 'mahout-1464' into mahout-1541

commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-05T01:44:16Z

    MAHOUT-1541 still working on this, some refactoring in the DSL for 
abstracting away Spark has moved access to rddsno Jira is closed yet

commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-05T16:52:23Z

    MAHOUT-1464 import cleanup, minor changes to examples for running on Spark 
Cluster

commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-05T16:55:09Z

    Merge branch 'mahout-1464' into mahout-1541

commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-07T20:39:25Z

    Merge branch 'master' into mahout-1541

commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-07T21:27:06Z

    drmWrap seems to be the answer to the changed DrmLike interface. Code works 
again but more to do.

commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-17T16:00:38Z

    merged master, got new cooccurrence code

commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-19T17:08:02Z

    for high level review, not ready for merge

commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-19T17:46:23Z

    for high level review, not ready for merge. changed to dot notation

commit f62ab071869ee205ad398a3e094d871138e11a9e
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-19T18:13:44Z

    for high level review, not ready for merge. fixed a couple scaladoc refs

commit cbef0ee6264c28d0597cb2507427a647771c9bcd
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-23T21:49:20Z

    adding tests, had to modify some test framework Scala to make the masterUrl 
visible to tests

commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-25T15:41:54Z

    adding more tests for ItemSimilarityDriver

commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-25T15:44:47Z

    merged master changes and fixed a pom conflict

commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-25T16:57:55Z

    remove tmp after all tests, fixed dangling comma in input file list

commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-26T22:19:48Z

    changes to pom, mahout driver script, and cleaned up help text

commit 213b18dee259925de82c703451bdea640e1f068e
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-26T22:26:17Z

    added a job.xml assembly for creation of an all-dependencies jar

commit 627d39f30860e4ab43783c72cc2cf8926060b73c
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-27T16:44:37Z

    registered HashBiMap with JavaSerializer in Kryo

commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-27T21:30:13Z

    increased the default max heep for mahout/JVM to 4g, using max of 4g for 
Spark executor

commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
Author: Pat Ferrel <p...@farfetchers.com>
Date:   2014-06-30T17:06:49Z

    tweaking memory requirements to process epinions with the 
ItemSimilarityDriver

commit 6ec98f32775c791ee001fc996f475215e427f368
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-30T17:08:49Z

    refactored to use a DistributedContext instead of raw SparkContext

commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-30T17:08:59Z

    merging changes made on to run a large dataset through itemsimilarity on 
the cluster

commit 8e70091a564c8464ea70bf90006d8124c3a7f208
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-30T20:11:42Z

    fixed a bug, SparkConf in driver was ignored and blank one passed in to 
create a DistributedContext

commit 01a0341f56071d2244aabd6de8c6f528ad35b164
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-30T20:33:39Z

    added option for configuring Spark executor memory

commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
Author: pferrel <p...@occamsmachete.com>
Date:   2014-06-30T22:37:19Z

    removed some outdated examples

commit 9fb281022cba7666dd26701b3d97d200b13c35f8
Author: pferrel <p...@occamsmachete.com>
Date:   2014-07-01T18:17:42Z

    test naming and pom changed to up the jvm heap max to 512m for scalatests

commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
Author: pferrel <p...@occamsmachete.com>
Date:   2014-07-01T18:18:59Z

    Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into 
mahout-1541

----


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: New Feature
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to