[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088160#comment-14088160 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51389335 @pferrel perhaps you could look at ItemSimilaritySuite, it doesn't work on spark 1.0 here? I disabled the tests for now since they are failing. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
Requiring Java 1.7 for Mahout
As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
[jira] [Updated] (MAHOUT-1601) Add javadoc for the classes - as there is no clue what the class is for .
[ https://issues.apache.org/jira/browse/MAHOUT-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harish Kayarohanam updated MAHOUT-1601: --- Issue Type: Documentation (was: Bug) Add javadoc for the classes - as there is no clue what the class is for . - Key: MAHOUT-1601 URL: https://issues.apache.org/jira/browse/MAHOUT-1601 Project: Mahout Issue Type: Documentation Components: Documentation Reporter: Harish Kayarohanam Priority: Minor Labels: documentation I found that the following classes org.apache.mahout.cf.taste.impl.neighborhood.DummySimilarity org.apache.mahout.cf.taste.impl.similarity.GenericUserSimilarity org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity did not have java doc . So I was unable to find what these classes are for . Shall we add java doc for the same ? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAHOUT-1601) Add javadoc for the classes - as there is no clue what the class is for .
[ https://issues.apache.org/jira/browse/MAHOUT-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harish Kayarohanam updated MAHOUT-1601: --- Priority: Minor (was: Major) Add javadoc for the classes - as there is no clue what the class is for . - Key: MAHOUT-1601 URL: https://issues.apache.org/jira/browse/MAHOUT-1601 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Harish Kayarohanam Priority: Minor Labels: documentation I found that the following classes org.apache.mahout.cf.taste.impl.neighborhood.DummySimilarity org.apache.mahout.cf.taste.impl.similarity.GenericUserSimilarity org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity did not have java doc . So I was unable to find what these classes are for . Shall we add java doc for the same ? -- This message was sent by Atlassian JIRA (v6.2#6252)
RE: Requiring Java 1.7 for Mahout
you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
RE: Requiring Java 1.7 for Mahout
oracle? Date: Wed, 6 Aug 2014 13:54:43 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
RE: Requiring Java 1.7 for Mahout
also sorry- btw- I assuming 1500 will be merged.. From: ap@outlook.com To: dev@mahout.apache.org Subject: RE: Requiring Java 1.7 for Mahout Date: Wed, 6 Aug 2014 16:56:39 -0400 oracle? Date: Wed, 6 Aug 2014 13:54:43 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
My own feeling is that 1.6 is finally dying out and officially moving to 1.7 allows some nice capabilities which assist in writing reliable code. Of the major changes, here are my reactions after about a year of using 1.7 seriously: IO and New IO - The nio package is significantly better and more coherent in 7. The differences creep up on you if you are not looking for them, but add up over time. It is hard to point at any single important change, however. Networking - The networking integrates IPV6 better. This probably has no impact on Mahout. Concurrency Utilities - concurrency has some much better capabilities in Java 7. Fork/Join and work-stealing both make significant differences in capabilities for threaded applications. Obviously, we don't benefit from these capabilities in existing code, but it would be nice to be free to use them in new code. Java XML - JAXP, JAXB, and JAX-WS - I don't know how much this matters given that Jackson works so very well. Strings in switch Statements - This really helps some codes The try-with-resources Statement - this is the biggest deal for me. Handling exceptions and closeable resources correctly is incredibly difficult without this (see the guava rationale for removing closeQuietly) Catching Multiple Exception Types and Rethrowing Exceptions with Improved Type Checking - this makes a lot of code much simpler. Exception handling code is verbose and difficult to get right with many branches doing nearly the same thing. Underscores in Numeric Literals - cute. Helps readability. Type Inference for Generic Instance Creation - For me this is a small issue since IntelliJ does the type inference for me and hides type parameters that I don't want to know about. Java Virtual Machine (JVM) - The Java7 JVM seems to me to be a bit more performant in a number of areas. These differences aren't night and day Java Virtual Machine Support for Non-Java Languages - This impacts the ability to integrate non-Java languages such as Jython and Javascript with programs. Little direct impact on Mahout given our recent focus on Scala and given the fact that Scala is reportedly jumping directly from Java6 to Java8 byte codes as of 2.12. Garbage-First Collector - the G1 collector is heaps better (pun intended) than previous collectors. I have had several programs that have long-lived data structures work much better with G1. Server configuration is vastly simpler. Java HotSpot Virtual Machine Performance Enhancements - these speak for themselves. On Wed, Aug 6, 2014 at 2:33 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
I am not sure if it actually would require 1.7 to build either, since my understanding dependencies are second-order and deeper, not immediate. Did you try to compile it yet? On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com wrote: also sorry- btw- I assuming 1500 will be merged.. From: ap@outlook.com To: dev@mahout.apache.org Subject: RE: Requiring Java 1.7 for Mahout Date: Wed, 6 Aug 2014 16:56:39 -0400 oracle? Date: Wed, 6 Aug 2014 13:54:43 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
RE: Requiring Java 1.7 for Mahout
It does not require 1.7 to build. I've been running 1.6 as well. I did compile m-1500. It builds fine with 1.6, but tests fail (only the h2o module- as you said due to the h2o artifact being built w 1.7). My thinking is that we don't want new Mahout users building with 1.6, having tests fail and walking away. Can we release with failing tests (even if its 1.6 specific)? As well if there were other issues with 1.6 holding us back, 1.6 is getting old and there's no real drawbacks, maybe we should consider 1.7 as an official version. Or as I said just make a quick fix on the building from source page. Date: Wed, 6 Aug 2014 14:21:36 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org I am not sure if it actually would require 1.7 to build either, since my understanding dependencies are second-order and deeper, not immediate. Did you try to compile it yet? On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com wrote: also sorry- btw- I assuming 1500 will be merged.. From: ap@outlook.com To: dev@mahout.apache.org Subject: RE: Requiring Java 1.7 for Mahout Date: Wed, 6 Aug 2014 16:56:39 -0400 oracle? Date: Wed, 6 Aug 2014 13:54:43 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the official minimum Java version for building and running Mahout. Are there any objections to this or problems that I am missing? Andy
Re: Requiring Java 1.7 for Mahout
sorry if this is not adding to the discussion. Based on what you are saying, my feeling is that all that is false dilemma. (Assuming h20bindings also compile with 1.6 and we will find a way to iron out 1.6 test issue easily enough. If not, bummer then.) Requiring != supporting. Current master supports both things in terms of build/runtime compatibility. Requiring 1.7 means supporting only one thing. From where i come from, having two things is usually better than having one. Unless one of two things is given away in favor of something substantially better. The only such thing presumably worth the sacrifice would be new code contributions to Mahout that absolutely require 1.7 for semantic reasons. (since runtime 1.7 and i suspect even 1.8 are already supported). Some new dependencies that come only in 1.7 artifacts might be another one. As it stands, Mahout's master branch currently has exactly 0 semantically 1.7 or 1.8 java in its code base. Until such contribution appears, the issue seems moot (not sure about 1500, never tried to compile it, this might be a valid reason to move up.) And it is not terribly likely to appear because Mahout is leaning towards scala contributions now. So new substantial java contributions are not terribly likely. Also, in the community where i twit there's general sense that there's nothing in 1.7 or 1.8 that upends Scala in any meaningful way, so the tools around will likely see just more scala based stuff (scalding, spark, Mahout algebra, MLOptimizer, Breeze to name just a few examples of the newer an more popular scala stuff ). On java side, on the other hand, there's been practically no new projects introduced of similar scale in past couple years. On Wed, Aug 6, 2014 at 2:39 PM, Andrew Palumbo ap@outlook.com wrote: It does not require 1.7 to build. I've been running 1.6 as well. I did compile m-1500. It builds fine with 1.6, but tests fail (only the h2o module- as you said due to the h2o artifact being built w 1.7). My thinking is that we don't want new Mahout users building with 1.6, having tests fail and walking away. Can we release with failing tests (even if its 1.6 specific)? As well if there were other issues with 1.6 holding us back, 1.6 is getting old and there's no real drawbacks, maybe we should consider 1.7 as an official version. Or as I said just make a quick fix on the building from source page. Date: Wed, 6 Aug 2014 14:21:36 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org I am not sure if it actually would require 1.7 to build either, since my understanding dependencies are second-order and deeper, not immediate. Did you try to compile it yet? On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com wrote: also sorry- btw- I assuming 1500 will be merged.. From: ap@outlook.com To: dev@mahout.apache.org Subject: RE: Requiring Java 1.7 for Mahout Date: Wed, 6 Aug 2014 16:56:39 -0400 oracle? Date: Wed, 6 Aug 2014 13:54:43 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org or testing. On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: My current java is 1.6.0_38, i have no problem building. On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote: you're right- my big concern is that on our (probably outdated) building from source page we have 1.6 listed: http://mahout.apache.org/developers/buildingmahout.html The obvious simple fix here is to make the quick change on the webpage to 1.7 in order to build and test successfully. I do remember something about being limited to our current lucene version though by 1.6 so i am wondering if this is may be a good time to push or require 1.7. Just checking our bases, so I'll drop it if there's no problem here. Thanks Date: Wed, 6 Aug 2014 13:33:19 -0700 Subject: Re: Requiring Java 1.7 for Mahout From: dlie...@gmail.com To: dev@mahout.apache.org the only problem is that we are not really requiring it. We are not using anything of 1.7 functionality. If people compile (as i do) Mahout, they can compile any bytecode version they want. There are some 1.7 artifact dependencies in H20 but 1.7 would be required at run time only and only if the people are actually using h2obindings as dependency (which i expect majority would not care for). On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote: As far as I can tell there should be no problems with declaring Java 1.7 as the
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088350#comment-14088350 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51404288 So, can we move the package please? Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088365#comment-14088365 ] ASF GitHub Bot commented on MAHOUT-1493: Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51405333 Absolutely! I was just getting ready to do this and write some tests. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088367#comment-14088367 ] ASF GitHub Bot commented on MAHOUT-1493: Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51405454 Am hoping to see how (short) the Spark and H2O tests compare Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1597) A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side
[ https://issues.apache.org/jira/browse/MAHOUT-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088369#comment-14088369 ] Hudson commented on MAHOUT-1597: SUCCESS: Integrated in Mahout-Quality #2732 (See [https://builds.apache.org/job/Mahout-Quality/2732/]) MAHOUT-1597: A + 1.0 (fixes) (dlyubimov: rev 7a50a291b4598e9809f9acf609b92175ce7f953b) * spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala * math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side Key: MAHOUT-1597 URL: https://issues.apache.org/jira/browse/MAHOUT-1597 Project: Mahout Issue Type: Bug Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 {code} // Concoct an rdd with missing rows val aRdd: DrmRdd[Int] = sc.parallelize( 0 - dvec(1, 2, 3) :: 3 - dvec(3, 4, 5) :: Nil ).map { case (key, vec) = key - (vec: Vector)} val drmA = drmWrap(rdd = aRdd) val controlB = inCoreA + 1.0 val drmB = drmA + 1.0 (drmB -: controlB).norm should be 1e-10 {code} should not fail. it was failing due to elementwise scalar operator only evaluates rows actually present in dataset. In case of Int-keyed row matrices, there are implied rows that yet may not be present in RDD. Our goal is to detect the condition and evaluate missing rows prior to physical operators that don't work with missing implied rows. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Requiring Java 1.7 for Mahout
On Wed, Aug 6, 2014 at 3:48 PM, Suneel Marthi suneel.mar...@gmail.com wrote: It should work fine with Java 1.7. Mahout's presently at Lucene 4.6.x and Lucene versions = 4.7 mandate JDK 1.7. For what it's worth, the current version of Lucene is 4.9.
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088376#comment-14088376 ] ASF GitHub Bot commented on MAHOUT-1493: Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406299 Actually I'm not sure if this would work against H2O, as the code is doing observationsPerLabel.map(new MatrixOps(_).colSums) which happens on RDD (and not on DRM, because of the implicit conversion). We would need to generic'ize that somehow. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088382#comment-14088382 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406631 @andrewpalumbo can you perhaps comment on the code itself so we see what you are talking about? Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088385#comment-14088385 ] ASF GitHub Bot commented on MAHOUT-1493: Github user avati commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51406691 Oops, I misread.. the map() happens on Array, my bad! I must admit I do not (yet) know how this code is working on DRM in a distributed way (i.e, to compare two backends) Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088391#comment-14088391 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908429 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + --- End diff -- Imports look engine-independent. Should not be Spark-coupled. Imports probably should include implicit operations per document (which is why the code does weird stuff like `new MatrixOps(m)`). The standard recommended way to do imports for engine -independent code import org.apache.mahout.math._ import scalabindings._ import RLikeOps._ import drm._ import RLikeDrmOps._ if java collections are used (e.g. something like `for (row - matrix) { ... }`) then it also would need import collection._ import JavaConversions._ to enable all implicits. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088394#comment-14088394 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908475 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + +// distributed summation of all observations per label +val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) --- End diff -- if imports are properly done, this should just be val weightsPerLabelAndFeature = dense(observationsPerLabel.map(_.colSums)) Note that this is not Spark-dependent code. the `map` here is Scala collection map. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088401#comment-14088401 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908555 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + +// distributed summation of all observations per label +val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) +// local summation of all weights per feature +val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums +// local summation of all weights per label +val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums + +// perLabelThetaNormalizer Vector is expected by NaiveBayesModel. We can pass a null value +// in the case of a standard NB model +var thetaNormalizer: org.apache.mahout.math.Vector= null + +// instantiate a trainer and retrieve the perLabelThetaNormalizer Vector from it in the case of +// a complementary NB model +if( trainComplementary ){ + val thetaTrainer = new ComplementaryThetaTrainer(weightsPerFeature, weightsPerLabel, alphaI) + // local training of the theta normalization + for (labelIndex - 0 until new MatrixOps(weightsPerLabelAndFeature).nrow) { +thetaTrainer.train(labelIndex, weightsPerLabelAndFeature.viewRow(labelIndex)) --- End diff -- in Mahout Scala, this slicing should look ... weightsPerLabelAndFeature(labelIndex, ::) Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088397#comment-14088397 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908508 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + +// distributed summation of all observations per label +val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) +// local summation of all weights per feature +val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums +// local summation of all weights per label +val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums --- End diff -- Same thing here. now need for `new MatrixOps`... here and elsewhere Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088400#comment-14088400 ] ASF GitHub Bot commented on MAHOUT-1493: Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51407423 Yes-- and as a reminder this code is a compilation of patches that were writen before the abastraction away from spark (not by myself). I've not looked at it too closely, and just put it up for comment and feedback for the Berlin TU students. I've been away almost the entire summer and had planned on doing some work on this to get back up to speed and to try out the H2o bindings. So please- yes any comments are welcome. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088396#comment-14088396 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908489 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f + + /** + * Distributed training of a Naive Bayes model. + * + * @param observationsPerLabel an array of matrices. Every matrix contains the observations for a particular label. + * @param trainComplementary whether to train a complementary Naive Bayes model + * @param alphaI smoothing parameter + * @return trained naive bayes model + */ + def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], trainComplementary :Boolean = true, + alphaI: Float = defaultAlphaI): NaiveBayesModel = { + +// distributed summation of all observations per label +val weightsPerLabelAndFeature = scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums)) +// local summation of all weights per feature +val weightsPerFeature = new MatrixOps(weightsPerLabelAndFeature).colSums --- End diff -- Similarly this should be just val weightsPerFeature = weightsPerLabelAndFeature.colSums Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088415#comment-14088415 ] ASF GitHub Bot commented on MAHOUT-1493: Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/32#discussion_r15908947 --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala --- @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.sparkbindings.drm.classification + +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.classifier.naivebayes.NaiveBayesModel +import org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer + +import scala.reflect.ClassTag + +/** + * Distributed training of a Naive Bayes model. Follows the approach presented in Rennie et.al.: Tackling the poor + * assumptions of Naive Bayes Text classifiers, ICML 2003, http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf + */ +object NaiveBayes { + + /** default value for the smoothing parameter */ + def defaultAlphaI = 1f --- End diff -- Mahout convention is to write these as `1.0` rather than a float. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
CoocurrenceAnalysis[Suite].scala - math-scala (?)
Sorry can't recollect what this discussion ended with. Why are we not moving these files to math-scala? the code seems to be engine-independent.
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088433#comment-14088433 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51409540 On Wed, Aug 6, 2014 at 3:55 PM, Pat Ferrel notificati...@github.com wrote: Sorry was off the internet during a move (curse you giant nameless cable company!) Anyway these tests are substantially changed in #36 https://github.com/apache/mahout/pull/36 but I haven't been able to get the new build until now, will check and push 36 first. As to building and tearing down contexts I'm not helping things. For each driver test DistributedSparkSuite in the beforeEach creates a context so I use that to start the test. Then the driver I am using needs to start a context so for every time I call a driver I precede it with the afterEach call to shut down the context. Then call the driver, then call beforeEach to restore the test context. I also had to tell the driver in a special invisible option not to load Mahout jars with a --dontAddMahoutJars. So the context is being built 3 times for every test. but that hasn't changed, it's always been that way. We could reuse a single context per test but it would require disabling some stuff in the driver along the lines of what I had to do with --dontAddMahoutJars. Since I've already had to do this I don't think it would be a big deal to disable a little more. I'll look at it once 36 is pushed. Is there any reason to build the context more than once per suite? Usually, there's not and that's exactly what this branch is moving towards (note: this PR is not against master but to to a side branch called `spark-1.0.x`). Also that's what they seem to have done in Spark 1.0 as well. There are sometimes (in my other projects) a need to create a custom context but not in Mahout codebase. Seems like if I disable the context things in the driver we could run all tests in a single context, right? Right. This branch has already switched to doing that. All algebra tests seem to be fine but these tests are failing now. not sure why. seems functional to me. — Reply to this email directly or view it on GitHub https://github.com/apache/mahout/pull/40#issuecomment-51408987. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088443#comment-14088443 ] ASF GitHub Bot commented on MAHOUT-1603: Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51410370 OK so DistributedSparkSuite moved the create context into the beforeAll? Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088452#comment-14088452 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51410605 OK so DistributedSparkSuite moved the create context into the beforeAll? on this branch, yes. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
co-occurrence paper and code
So, compared to original paper [1], similarity is now hardcoded and always LLR? Do we have any plans to parameterize that further? Is there any reason to parameterize it? Also, reading the paper, i am a bit wondering -- similarity and distance are functions that usually are moving into different directions (i.e. cosine similarity and angular distance) but in the paper distance scores are also considered similarities? How's that? I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
The entire reference to similarity harks back to the original formulation of the MovieLens and Firefox recommenders which looked for similarity of rating patterns. That made some sense then, but it is a bit of a tortured turn of phrase when other formulations of recommendation are used. There are currently two general approaches that seem to be generating reasonable recommendation results in practice, LLR based sparsification of cooccurrence and cross-occurrence matrices and matrix completion techniques typically implemented as some form of factorization. The enormous number of options that Mahout's map-reduce recommender implements have little practical utility and are more of an artifact of a desire to implement most of the research algorithms in a single framework. The concept of distance can be useful in the matrix factorization since it allows efficient algorithms to be derived. But with the sparsification problem, the concepts of similarity and distance break down because with cooccurrence we don't just have two answers. Instead, we have three: anomalous cooccurrence, non-anomalous cooccurrence and insufficient data. For the purposes of sparsification, we lump non-anomalous cooccurrence and insufficient data together, but this lumping has the side effect that the score that we get is not a useful measure of association, distance or similarity. Instead, we just put down that anomalously cooccurrent pairs are anomalous (a binary decision) and leave the weighting of them until later. If you are strict about thinking about cooccurrence measures as a distance, you get into measures of the strength of association. These measures will separate anomalous cooccurrence from non-anomalous cooccurrence, but they will smear the insufficient data cases into both options. Since most pairs have insufficient data, this will be a relatively disastrous thing to do, causing massive numbers of false positives that swamp the valid pairs. The virtue of LLR is that it does not do this, but there is a corollary vice in that the resulting score is not useful as a distance. Regarding the question about similarities and distances being used essentially synonymously, this is relatively common because of the fairly strict anti-correlation between them. Yes, there is a sign change, but they still are representing basically the same thing. Elevation and depth are similarly very closely related and somebody might refer to the elevation of an underwater mountain range above its base or its depth below the surface. These expressions are referring to the same z axis measurement with different origins. On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So, compared to original paper [1], similarity is now hardcoded and always LLR? Do we have any plans to parameterize that further? Is there any reason to parameterize it? Also, reading the paper, i am a bit wondering -- similarity and distance are functions that usually are moving into different directions (i.e. cosine similarity and angular distance) but in the paper distance scores are also considered similarities? How's that? I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
is this val bcastNumInteractions = drmBroadcast(drmI.numNonZeroElementsPerColumn) any different from just saying `drmI.colSums`? On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088486#comment-14088486 ] ASF GitHub Bot commented on MAHOUT-1603: Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51413783 Do you want to push this with the ignores and I'll fix them to use the new DistributedSparkSuite as it gets into the master? BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7 and Spark 1? Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis
[ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088487#comment-14088487 ] ASF GitHub Bot commented on MAHOUT-1541: Github user pferrel closed the pull request at: https://github.com/apache/mahout/pull/36 Create CLI Driver for Spark Cooccurrence Analysis - Key: MAHOUT-1541 URL: https://issues.apache.org/jira/browse/MAHOUT-1541 Project: Mahout Issue Type: New Feature Components: CLI Reporter: Pat Ferrel Assignee: Pat Ferrel Create a CLI driver to import data in a flexible manner, create an IndexedDataset with BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate params, then write output with external IDs optionally reattached. Ultimately it should be able to read input as the legacy mr does but will support reading externally defined IDs and flexible formats. Output will be of the legacy format or text files of the user's specification with reattached Item IDs. Support for legacy formats is a question, users can always use the legacy code if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished without any writing to an actual file so the legacy sequence file output may not be needed. Opinions? -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. What i meant here it doesn't produce a p-value. or does it? It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL
[ https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088492#comment-14088492 ] ASF GitHub Bot commented on MAHOUT-1493: Github user andrewpalumbo commented on the pull request: https://github.com/apache/mahout/pull/32#issuecomment-51414355 @dlyubimov thanks for all the comments I'm going to try to get the changes in shortly. Port Naive Bayes to the Spark DSL - Key: MAHOUT-1493 URL: https://issues.apache.org/jira/browse/MAHOUT-1493 Project: Mahout Issue Type: Bug Components: Classification Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 1.0 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493a.patch Port our Naive Bayes implementation to the new spark dsl. Shouldn't require more than a few lines of code. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: co-occurrence paper and code
Yes because the matrix A’A is not necessarily boolean. The actual value is ignored but it’s in the matrix so the colSums was not correct. On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: is this val bcastNumInteractions = drmBroadcast(drmI.numNonZeroElementsPerColumn) any different from just saying `drmI.colSums`? On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
i thought `drmI` argument here meant indicator matrix (i.e. 1.0 or 0.0 ) ? On Wed, Aug 6, 2014 at 5:03 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes because the matrix A’A is not necessarily boolean. The actual value is ignored but it’s in the matrix so the colSums was not correct. On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: is this val bcastNumInteractions = drmBroadcast(drmI.numNonZeroElementsPerColumn) any different from just saying `drmI.colSums`? On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: LLR is a classic test. What i meant here it doesn't produce a p-value. or does it? It produces an asymptotically chi^2 distributed statistic with 1-degree of freedom (for our case of 2x2 contingency tables) which can be reduced trivially to a p-value in the standard way. It is as much a classic test as a t-test, a chi^2 test, an F test or any other of the gazillion usual suspects.
Re: co-occurrence paper and code
I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 6:03 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes because the matrix A’A is not necessarily boolean. The actual value is ignored but it’s in the matrix so the colSums was not correct. On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: is this val bcastNumInteractions = drmBroadcast(drmI.numNonZeroElementsPerColumn) any different from just saying `drmI.colSums`? Ignore what I said and listen to this guy. I forgot that this was after the cooccurrence counting.
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 4:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: strict anti-correlation between them. Yes, there is a sign change, but they still are representing basically the same thing. Elevation and depth Sure. This is basic knowledge. The reason i asked is because the original paper gives formulation without sign change in section 4.4 (e.g. cosine similarity and manhattan distance formulas) and bills it as a functional parameter to similarity calculation. Which would seem to result in an technical error as described there since they make no mention about this distinction at all. Just was wondering if this was compensated for somewhere else that i don't immediately see. On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So, compared to original paper [1], similarity is now hardcoded and always LLR? Do we have any plans to parameterize that further? Is there any reason to parameterize it? Also, reading the paper, i am a bit wondering -- similarity and distance are functions that usually are moving into different directions (i.e. cosine similarity and angular distance) but in the paper distance scores are also considered similarities? How's that? I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
We went around this may-pole a while ago. It is the same if the matrix is binary. It isn't otherwise. Whether this code might someday be used in a context with non-binary inputs is an open question. Likewise, whether it is worth saving some time by omitting a thresholding operation to binarize the matrix is similarly not clear. My feeling is that assuming a binary matrix is fine. On Wed, Aug 6, 2014 at 5:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: is this val bcastNumInteractions = drmBroadcast(drmI.numNonZeroElementsPerColumn) any different from just saying `drmI.colSums`? On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf -d
Re: co-occurrence paper and code
On Wed, Aug 6, 2014 at 5:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: LLR is a classic test. What i meant here it doesn't produce a p-value. or does it? It produces an asymptotically chi^2 distributed statistic with 1-degree of freedom (for our case of 2x2 contingency tables) which can be reduced trivially to a p-value in the standard way. Great. so that means that we can do h_0 rejection based on a %-expressed level?
Re: co-occurrence paper and code
Asking because i am considering pulling this implementation but for some (mostly political) reasons people want to try different things here. I may also have to start with a different way of constructing co-occurrences, and may do a few optimizations there (i.e. priority queue queing/enqueing does twice the work it really needs to do etc.) On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
Re: co-occurrence paper and code
what i mean here i probably need to refactor it a little so that there's part of algorithm that accepts co-occurrence input directly and which is somewhat decoupled from the part that accepts u x item input and does downsampling and co-occurrence construction. So i could do some customization of my own to co-occurrence construction. Would that be reasonable if i do that? On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Asking because i am considering pulling this implementation but for some (mostly political) reasons people want to try different things here. I may also have to start with a different way of constructing co-occurrences, and may do a few optimizations there (i.e. priority queue queing/enqueing does twice the work it really needs to do etc.) On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088506#comment-14088506 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51415419 On Wed, Aug 6, 2014 at 4:56 PM, Pat Ferrel notificati...@github.com wrote: Do you want to push this with the ignores and I'll fix them to use the new DistributedSparkSuite as it gets into the master? No i probably don't want ot merge it with non-working tests. As usual, i can add you as collaborator in my account (if i have not yet done so) so you could push directly to my source branch of this (so it reflects in the PR instantaniously) or you can PR against my spark 1.0.x branch, or you can just send me a regular git patch with email, whichever works. BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7 and Spark 1? The reason Scala is fixed where it is fixed is because it is paired to Spark's version of Scala. Migration between major versions of Scala is a big deal, for Spark and otherwise. Stuff will not work. Minor version of Scala should be generally portable. — Reply to this email directly or view it on GitHub https://github.com/apache/mahout/pull/40#issuecomment-51413783. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: co-occurrence paper and code
Sounds good to me. -s Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com: what i mean here i probably need to refactor it a little so that there's part of algorithm that accepts co-occurrence input directly and which is somewhat decoupled from the part that accepts u x item input and does downsampling and co-occurrence construction. So i could do some customization of my own to co-occurrence construction. Would that be reasonable if i do that? On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Asking because i am considering pulling this implementation but for some (mostly political) reasons people want to try different things here. I may also have to start with a different way of constructing co-occurrences, and may do a few optimizations there (i.e. priority queue queing/enqueing does twice the work it really needs to do etc.) On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088513#comment-14088513 ] ASF GitHub Bot commented on MAHOUT-1603: Github user avati commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51415479 Scala 2.11 port of Spark is in progress [https://issues.apache.org/jira/browse/SPARK-1812] Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088517#comment-14088517 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51415624 sure. there're tons of stuff in progress but we can only use released artifact as dependencies. On Wed, Aug 6, 2014 at 5:19 PM, Anand Avati notificati...@github.com wrote: Scala 2.11 port of Spark is in progress [ https://issues.apache.org/jira/browse/SPARK-1812] — Reply to this email directly or view it on GitHub https://github.com/apache/mahout/pull/40#issuecomment-51415479. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088520#comment-14088520 ] ASF GitHub Bot commented on MAHOUT-1603: Github user avati commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51415697 Only meant FYI (in case someone is planning anything). Of course we have to wait for release. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088526#comment-14088526 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51416019 alternatively, you can also just give me a verbal hint what i need to fix, and i can try to patch to the best of my ability. On Wed, Aug 6, 2014 at 5:18 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:56 PM, Pat Ferrel notificati...@github.com wrote: Do you want to push this with the ignores and I'll fix them to use the new DistributedSparkSuite as it gets into the master? No i probably don't want ot merge it with non-working tests. As usual, i can add you as collaborator in my account (if i have not yet done so) so you could push directly to my source branch of this (so it reflects in the PR instantaniously) or you can PR against my spark 1.0.x branch, or you can just send me a regular git patch with email, whichever works. BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7 and Spark 1? The reason Scala is fixed where it is fixed is because it is paired to Spark's version of Scala. Migration between major versions of Scala is a big deal, for Spark and otherwise. Stuff will not work. Minor version of Scala should be generally portable. — Reply to this email directly or view it on GitHub https://github.com/apache/mahout/pull/40#issuecomment-51413783. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: co-occurrence paper and code
BTW the cooccurrence code is going into RSJ too and there are uses of that where cosine is expected. I don’t know how to think about cross-cosine. Is there an argument for LLR only in RSJ? On Aug 6, 2014, at 5:20 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: Sounds good to me. -s Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com: what i mean here i probably need to refactor it a little so that there's part of algorithm that accepts co-occurrence input directly and which is somewhat decoupled from the part that accepts u x item input and does downsampling and co-occurrence construction. So i could do some customization of my own to co-occurrence construction. Would that be reasonable if i do that? On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Asking because i am considering pulling this implementation but for some (mostly political) reasons people want to try different things here. I may also have to start with a different way of constructing co-occurrences, and may do a few optimizations there (i.e. priority queue queing/enqueing does twice the work it really needs to do etc.) On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: I chose against porting all the similarity measures to the dsl version of the cooccurrence analysis for two reasons. First, adding the measures in a generalizable way makes the code superhard to read. Second, in practice, I have never seen something giving better results than llr. As Ted pointed out, a lot of the foundations of using similarity measures comes from wanting to predict ratings, which people never do in practice. I think we should restrict ourselves to approaches that work with implicit, count-like data. -s Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com: On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I suppose in that context LLR is considered a distance (higher scores mean more `distant` items, co-occurring by chance only)? Self-correction on this one -- having given a quick look at llr paper again, it looks like it is actually a similarity (higher scores meaning more stable co-occurrences, i.e. it moves in the opposite direction of p-value if it had been a classic test LLR is a classic test. It is essentially Pearson's chi^2 test without the normal approximation. See my papers[1][2] introducing the test into computational linguistics (which ultimately brought it into all kinds of fields including recommendations) and also references for the G^2 test[3]. [1] http://www.aclweb.org/anthology/J93-1003 [2] http://arxiv.org/abs/1207.1847 [3] http://en.wikipedia.org/wiki/G-test
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088538#comment-14088538 ] ASF GitHub Bot commented on MAHOUT-1603: Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51417124 Ok, I just pushed the new tests, maybe they work. Don't laugh it could happen. There are likely to be problems with my calling afterEach and beforeEach since their meaning has changed. Fixing this will require mods to the driver too I expect and it'll probably be easier for me to do it. If you are almost ready with this I'll upgrade to Spark 1.0.1 and grab your branch. Tweaks for Spark 1.0.x --- Key: MAHOUT-1603 URL: https://issues.apache.org/jira/browse/MAHOUT-1603 Project: Mahout Issue Type: Task Affects Versions: 0.9 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1568) Build an I/O model that can replace sequence files for import/export
[ https://issues.apache.org/jira/browse/MAHOUT-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088620#comment-14088620 ] Hudson commented on MAHOUT-1568: SUCCESS: Integrated in Mahout-Quality #2733 (See [https://builds.apache.org/job/Mahout-Quality/2733/]) MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e) * math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala * spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala * spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala * spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala * spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala * spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala * spark/src/main/scala/org/apache/mahout/drivers/Schema.scala * spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala Build an I/O model that can replace sequence files for import/export Key: MAHOUT-1568 URL: https://issues.apache.org/jira/browse/MAHOUT-1568 Project: Mahout Issue Type: New Feature Components: CLI Environment: Scala, Spark Reporter: Pat Ferrel Assignee: Pat Ferrel Implement mechanisms to read and write data from/to flexible stores. These will support tuples streams and drms but with extensions that allow keeping user defined values for IDs. The mechanism in some sense can replace Sequence Files for import/export and will make the operation much easier for the user. In many cases directly consuming their input files. Start with text delimited files for input/output in the Spark version of ItemSimilarity A proposal is running with ItemSimilarity on Spark and is documented on the github wiki here: https://github.com/pferrel/harness/wiki Comments are appreciated -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis
[ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088622#comment-14088622 ] Hudson commented on MAHOUT-1541: SUCCESS: Integrated in Mahout-Quality #2733 (See [https://builds.apache.org/job/Mahout-Quality/2733/]) MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e) * math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala * spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala * spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala * spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala * spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala * spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala * spark/src/main/scala/org/apache/mahout/drivers/Schema.scala * spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala Create CLI Driver for Spark Cooccurrence Analysis - Key: MAHOUT-1541 URL: https://issues.apache.org/jira/browse/MAHOUT-1541 Project: Mahout Issue Type: New Feature Components: CLI Reporter: Pat Ferrel Assignee: Pat Ferrel Create a CLI driver to import data in a flexible manner, create an IndexedDataset with BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate params, then write output with external IDs optionally reattached. Ultimately it should be able to read input as the legacy mr does but will support reading externally defined IDs and flexible formats. Output will be of the legacy format or text files of the user's specification with reattached Item IDs. Support for legacy formats is a question, users can always use the legacy code if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished without any writing to an actual file so the legacy sequence file output may not be needed. Opinions? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1569) Create CLI driver that supports Spark jobs
[ https://issues.apache.org/jira/browse/MAHOUT-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088621#comment-14088621 ] Hudson commented on MAHOUT-1569: SUCCESS: Integrated in Mahout-Quality #2733 (See [https://builds.apache.org/job/Mahout-Quality/2733/]) MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e) * math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala * spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala * spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala * spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala * spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala * spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala * spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala * spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala * spark/src/main/scala/org/apache/mahout/drivers/Schema.scala * spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala Create CLI driver that supports Spark jobs -- Key: MAHOUT-1569 URL: https://issues.apache.org/jira/browse/MAHOUT-1569 Project: Mahout Issue Type: New Feature Components: CLI Environment: Scala, Spark Reporter: Pat Ferrel Assignee: Pat Ferrel Create a design for CLI drivers, including an option parser, base MahoutDriver for Spark, that uses a text file I/O mechanism MAHOUT-1568 A version of the proposal is implemented and running for ItemSimilarity on Spark. MAHOUT-1541 A proposal is running with ItemSimilarity on Spark and is documented on the github wiki here: https://github.com/pferrel/harness/wiki Comments are appreciated -- This message was sent by Atlassian JIRA (v6.2#6252)