[ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079886#comment-14079886 ]
ASF GitHub Bot commented on MAHOUT-1541: ---------------------------------------- GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/36 MAHOUT-1541 Parts of this address MAHOUT-1541, MAHOUT-1568, and MAHOUT-1569 The previous merge of MAHOUT-1541 was supporting A'A primarily, this merge support A'B as well with all features. Lots of refactoring and new tests for A and B of different cardinality and using different item ID spaces. Took the forced cardinality matching from Math and put in the data prep part. This means passing in an nrow to drmWrap, which may be larger than the actual number of rows embodied in the drm/rdd. I've added tests for B.t %*% A as well as the actual driver for these cases (missing row cases). Can't complete the full epinions cross-cooccurrence on a single machine with an out of Java heap exception. So I'm now testing it on a cluster. The cooccurrence of A'A does complete on a single machine. One known improvement is to limit the use of dictionaries if they are not need and to look at replacing the Guava HashBiMap with a minimal Scala verison. This version uses dictionaries for IDs even if the input is using Mahout sequential int IDs. MAHOUT-1568: Proposed standards for text versions of DRM-ish output. These preserve the IDs of the application while using Mahout IDs internally. In other words output has application IDs. There are several configurable readers and writers of TD files. Reading Tuples into a DRM is implemented, Writing a DRM-ish TD file is also implemented. MAHOUT-1569: There is a refactored MahoutOptionParser and MahoutDriver with some default behavior that should make creating drivers a bit easier and DRYer than the last merge. The options are being proposed as standards across all drivers so we have one way to specify the formats of input/output files and other common options. This is for comment, I won't merge until several larger dataset are working on a cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout mahout-1541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 ---- commit 107a0ba9605241653a85b113661a8fa5c055529f Author: pferrel <p...@occamsmachete.com> Date: 2014-06-04T19:54:22Z added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-04T20:09:07Z Adding stuff for itemsimilarity driver for Spark commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-04T20:13:13Z added scopt to pom deps commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-04T21:32:18Z added Sebastian's MurmurHash changes Signed-off-by: pferrel <p...@occamsmachete.com> commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a Author: pferrel <p...@occamsmachete.com> Date: 2014-06-04T21:33:11Z Merge branch 'mahout-1464' into mahout-1541 commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb Author: pferrel <p...@occamsmachete.com> Date: 2014-06-05T01:44:16Z MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet commit c6adaa44c80bba99d41600e260bbb1ad5c972e69 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-05T16:52:23Z MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-05T16:55:09Z Merge branch 'mahout-1464' into mahout-1541 commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-07T20:39:25Z Merge branch 'master' into mahout-1541 commit a2f84dea3f32d3df3e98c61f085bc1fabd453551 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-07T21:27:06Z drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do. commit d3a2ba5027436d0abef67a1a5e82557064f4ba49 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-17T16:00:38Z merged master, got new cooccurrence code commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-19T17:08:02Z for high level review, not ready for merge commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-19T17:46:23Z for high level review, not ready for merge. changed to dot notation commit f62ab071869ee205ad398a3e094d871138e11a9e Author: pferrel <p...@occamsmachete.com> Date: 2014-06-19T18:13:44Z for high level review, not ready for merge. fixed a couple scaladoc refs commit cbef0ee6264c28d0597cb2507427a647771c9bcd Author: pferrel <p...@occamsmachete.com> Date: 2014-06-23T21:49:20Z adding tests, had to modify some test framework Scala to make the masterUrl visible to tests commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c Author: pferrel <p...@occamsmachete.com> Date: 2014-06-25T15:41:54Z adding more tests for ItemSimilarityDriver commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-25T15:44:47Z merged master changes and fixed a pom conflict commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-25T16:57:55Z remove tmp after all tests, fixed dangling comma in input file list commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-26T22:19:48Z changes to pom, mahout driver script, and cleaned up help text commit 213b18dee259925de82c703451bdea640e1f068e Author: pferrel <p...@occamsmachete.com> Date: 2014-06-26T22:26:17Z added a job.xml assembly for creation of an all-dependencies jar commit 627d39f30860e4ab43783c72cc2cf8926060b73c Author: pferrel <p...@occamsmachete.com> Date: 2014-06-27T16:44:37Z registered HashBiMap with JavaSerializer in Kryo commit c273dc7de3c740189ce8157b334c2eef3a4c23ea Author: pferrel <p...@occamsmachete.com> Date: 2014-06-27T21:30:13Z increased the default max heep for mahout/JVM to 4g, using max of 4g for Spark executor commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a Author: Pat Ferrel <p...@farfetchers.com> Date: 2014-06-30T17:06:49Z tweaking memory requirements to process epinions with the ItemSimilarityDriver commit 6ec98f32775c791ee001fc996f475215e427f368 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-30T17:08:49Z refactored to use a DistributedContext instead of raw SparkContext commit 48774e154a6e55e04037c787f8d64bc9e545f1bd Author: pferrel <p...@occamsmachete.com> Date: 2014-06-30T17:08:59Z merging changes made on to run a large dataset through itemsimilarity on the cluster commit 8e70091a564c8464ea70bf90006d8124c3a7f208 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-30T20:11:42Z fixed a bug, SparkConf in driver was ignored and blank one passed in to create a DistributedContext commit 01a0341f56071d2244aabd6de8c6f528ad35b164 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-30T20:33:39Z added option for configuring Spark executor memory commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34 Author: pferrel <p...@occamsmachete.com> Date: 2014-06-30T22:37:19Z removed some outdated examples commit 9fb281022cba7666dd26701b3d97d200b13c35f8 Author: pferrel <p...@occamsmachete.com> Date: 2014-07-01T18:17:42Z test naming and pom changed to up the jvm heap max to 512m for scalatests commit 674c9b7862f0bd0723de026eb4527546b52e8a0b Author: pferrel <p...@occamsmachete.com> Date: 2014-07-01T18:18:59Z Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into mahout-1541 ---- > Create CLI Driver for Spark Cooccurrence Analysis > ------------------------------------------------- > > Key: MAHOUT-1541 > URL: https://issues.apache.org/jira/browse/MAHOUT-1541 > Project: Mahout > Issue Type: New Feature > Components: CLI > Reporter: Pat Ferrel > Assignee: Pat Ferrel > > Create a CLI driver to import data in a flexible manner, create an > IndexedDataset with BiMap ID translation dictionaries, call the Spark > CooccurrenceAnalysis with the appropriate params, then write output with > external IDs optionally reattached. > Ultimately it should be able to read input as the legacy mr does but will > support reading externally defined IDs and flexible formats. Output will be > of the legacy format or text files of the user's specification with > reattached Item IDs. > Support for legacy formats is a question, users can always use the legacy > code if they want this. Internal to the IndexedDataset is a Spark DRM so > pipelining can be accomplished without any writing to an actual file so the > legacy sequence file output may not be needed. > Opinions? -- This message was sent by Atlassian JIRA (v6.2#6252)