date:20140806


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088160#comment-14088160
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51389335
  
@pferrel perhaps you could look at ItemSimilaritySuite, it doesn't work on 
spark 1.0 here? I disabled the tests for now since they are failing.


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Requiring Java 1.7 for Mahout

As far as I can tell there should be no problems with declaring Java 1.7 as the 
official minimum Java version for building and running Mahout.  Are there any 
objections to this or problems that I am missing?

Andy

Re: Requiring Java 1.7 for Mahout

the only problem is that we are not really requiring it. We are not using
anything of 1.7 functionality. If people compile (as i do) Mahout, they can
compile any bytecode version they want.

There are some 1.7 artifact dependencies in H20 but 1.7 would be required
at run time only and only if the people are actually using h2obindings as
dependency (which i expect majority would not care for).


On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote:

 As far as I can tell there should be no problems with declaring Java 1.7
 as the official minimum Java version for building and running Mahout.  Are
 there any objections to this or problems that I am missing?

 Andy

[jira] [Updated] (MAHOUT-1601) Add javadoc for the classes - as there is no clue what the class is for .

2014-08-06 Thread Harish Kayarohanam (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish Kayarohanam updated MAHOUT-1601:
---

Issue Type: Documentation  (was: Bug)

 Add javadoc for the classes - as there is no clue what the class is for .
 -

 Key: MAHOUT-1601
 URL: https://issues.apache.org/jira/browse/MAHOUT-1601
 Project: Mahout
  Issue Type: Documentation
  Components: Documentation
Reporter: Harish Kayarohanam
Priority: Minor
  Labels: documentation

 I found that the following classes 
 org.apache.mahout.cf.taste.impl.neighborhood.DummySimilarity
 org.apache.mahout.cf.taste.impl.similarity.GenericUserSimilarity
 org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity
 did not have java doc . So I was unable to find what these classes are for .
 Shall we add java doc for the same ?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1601) Add javadoc for the classes - as there is no clue what the class is for .

2014-08-06 Thread Harish Kayarohanam (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish Kayarohanam updated MAHOUT-1601:
---

Priority: Minor  (was: Major)

 Add javadoc for the classes - as there is no clue what the class is for .
 -

 Key: MAHOUT-1601
 URL: https://issues.apache.org/jira/browse/MAHOUT-1601
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Harish Kayarohanam
Priority: Minor
  Labels: documentation

 I found that the following classes 
 org.apache.mahout.cf.taste.impl.neighborhood.DummySimilarity
 org.apache.mahout.cf.taste.impl.similarity.GenericUserSimilarity
 org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity
 did not have java doc . So I was unable to find what these classes are for .
 Shall we add java doc for the same ?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

RE: Requiring Java 1.7 for Mahout

you're right- my big concern is that on our (probably outdated) building from 
source page we have 1.6 listed:

http://mahout.apache.org/developers/buildingmahout.html

The obvious simple fix here is to make the quick change on the webpage to 1.7 
in order to build and test successfully.

I do remember something about being limited to our current lucene version 
though by 1.6 so i am wondering if this is may be a good time to push or 
require 1.7.

Just checking our bases, so I'll drop it if there's no problem here. 

Thanks

 



 Date: Wed, 6 Aug 2014 13:33:19 -0700
 Subject: Re: Requiring Java 1.7 for Mahout
 From: dlie...@gmail.com
 To: dev@mahout.apache.org
 
 the only problem is that we are not really requiring it. We are not using
 anything of 1.7 functionality. If people compile (as i do) Mahout, they can
 compile any bytecode version they want.
 
 There are some 1.7 artifact dependencies in H20 but 1.7 would be required
 at run time only and only if the people are actually using h2obindings as
 dependency (which i expect majority would not care for).
 
 
 On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote:
 
  As far as I can tell there should be no problems with declaring Java 1.7
  as the official minimum Java version for building and running Mahout.  Are
  there any objections to this or problems that I am missing?
 
  Andy

Re: Requiring Java 1.7 for Mahout

My current java is 1.6.0_38, i have no problem building.


On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote:

 you're right- my big concern is that on our (probably outdated) building
 from source page we have 1.6 listed:

 http://mahout.apache.org/developers/buildingmahout.html

 The obvious simple fix here is to make the quick change on the webpage to
 1.7 in order to build and test successfully.

 I do remember something about being limited to our current lucene version
 though by 1.6 so i am wondering if this is may be a good time to push or
 require 1.7.

 Just checking our bases, so I'll drop it if there's no problem here.

 Thanks





  Date: Wed, 6 Aug 2014 13:33:19 -0700
  Subject: Re: Requiring Java 1.7 for Mahout
  From: dlie...@gmail.com
  To: dev@mahout.apache.org
 
  the only problem is that we are not really requiring it. We are not using
  anything of 1.7 functionality. If people compile (as i do) Mahout, they
 can
  compile any bytecode version they want.
 
  There are some 1.7 artifact dependencies in H20 but 1.7 would be required
  at run time only and only if the people are actually using h2obindings as
  dependency (which i expect majority would not care for).
 
 
  On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com
 wrote:
 
   As far as I can tell there should be no problems with declaring Java
 1.7
   as the official minimum Java version for building and running Mahout.
  Are
   there any objections to this or problems that I am missing?
  
   Andy

Re: Requiring Java 1.7 for Mahout

or testing.


On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 My current java is 1.6.0_38, i have no problem building.


 On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote:

 you're right- my big concern is that on our (probably outdated) building
 from source page we have 1.6 listed:

 http://mahout.apache.org/developers/buildingmahout.html

 The obvious simple fix here is to make the quick change on the webpage to
 1.7 in order to build and test successfully.

 I do remember something about being limited to our current lucene version
 though by 1.6 so i am wondering if this is may be a good time to push or
 require 1.7.

 Just checking our bases, so I'll drop it if there's no problem here.

 Thanks





  Date: Wed, 6 Aug 2014 13:33:19 -0700
  Subject: Re: Requiring Java 1.7 for Mahout
  From: dlie...@gmail.com
  To: dev@mahout.apache.org
 
  the only problem is that we are not really requiring it. We are not
 using
  anything of 1.7 functionality. If people compile (as i do) Mahout, they
 can
  compile any bytecode version they want.
 
  There are some 1.7 artifact dependencies in H20 but 1.7 would be
 required
  at run time only and only if the people are actually using h2obindings
 as
  dependency (which i expect majority would not care for).
 
 
  On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com
 wrote:
 
   As far as I can tell there should be no problems with declaring Java
 1.7
   as the official minimum Java version for building and running Mahout.
  Are
   there any objections to this or problems that I am missing?
  
   Andy

RE: Requiring Java 1.7 for Mahout

oracle?

 Date: Wed, 6 Aug 2014 13:54:43 -0700
 Subject: Re: Requiring Java 1.7 for Mahout
 From: dlie...@gmail.com
 To: dev@mahout.apache.org

 or testing.

 On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

  My current java is 1.6.0_38, i have no problem building.

  On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote:

  you're right- my big concern is that on our (probably outdated) building
  from source page we have 1.6 listed:

  http://mahout.apache.org/developers/buildingmahout.html

  The obvious simple fix here is to make the quick change on the webpage to
  1.7 in order to build and test successfully.

  I do remember something about being limited to our current lucene version
  though by 1.6 so i am wondering if this is may be a good time to push or
  require 1.7.

  Just checking our bases, so I'll drop it if there's no problem here.

  Thanks

   Date: Wed, 6 Aug 2014 13:33:19 -0700
   Subject: Re: Requiring Java 1.7 for Mahout
   From: dlie...@gmail.com
   To: dev@mahout.apache.org

   the only problem is that we are not really requiring it. We are not
  using
   anything of 1.7 functionality. If people compile (as i do) Mahout, they
  can
   compile any bytecode version they want.

   There are some 1.7 artifact dependencies in H20 but 1.7 would be
  required
   at run time only and only if the people are actually using h2obindings
  as
   dependency (which i expect majority would not care for).

   On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com
  wrote:

As far as I can tell there should be no problems with declaring Java
  1.7
as the official minimum Java version for building and running Mahout.
   Are
there any objections to this or problems that I am missing?

Andy

RE: Requiring Java 1.7 for Mahout

also sorry- btw- I assuming 1500 will be merged..

 From: ap@outlook.com
 To: dev@mahout.apache.org
 Subject: RE: Requiring Java 1.7 for Mahout
 Date: Wed, 6 Aug 2014 16:56:39 -0400

 oracle?

  Date: Wed, 6 Aug 2014 13:54:43 -0700
  Subject: Re: Requiring Java 1.7 for Mahout
  From: dlie...@gmail.com
  To: dev@mahout.apache.org

  or testing.

  On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

   My current java is 1.6.0_38, i have no problem building.

   On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com wrote:

   you're right- my big concern is that on our (probably outdated) building
   from source page we have 1.6 listed:

   http://mahout.apache.org/developers/buildingmahout.html

   The obvious simple fix here is to make the quick change on the webpage to
   1.7 in order to build and test successfully.

   I do remember something about being limited to our current lucene version
   though by 1.6 so i am wondering if this is may be a good time to push or
   require 1.7.

   Just checking our bases, so I'll drop it if there's no problem here.

   Thanks

Date: Wed, 6 Aug 2014 13:33:19 -0700
Subject: Re: Requiring Java 1.7 for Mahout
From: dlie...@gmail.com
To: dev@mahout.apache.org

the only problem is that we are not really requiring it. We are not
   using
anything of 1.7 functionality. If people compile (as i do) Mahout, they
   can
compile any bytecode version they want.

There are some 1.7 artifact dependencies in H20 but 1.7 would be
   required
at run time only and only if the people are actually using h2obindings
   as
dependency (which i expect majority would not care for).

On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com
   wrote:

 As far as I can tell there should be no problems with declaring Java
   1.7
 as the official minimum Java version for building and running Mahout.
Are
 there any objections to this or problems that I am missing?

 Andy

Re: Requiring Java 1.7 for Mahout

My own feeling is that 1.6 is finally dying out and officially moving to
1.7 allows some nice capabilities which assist in writing reliable code.
 Of the major changes, here are my reactions after about a year of using
1.7 seriously:

IO and New IO - The nio package is significantly better and more
coherent in 7.  The differences creep up on you if you are not
looking for them, but add up over time.  It is hard to point at
any single important change, however.

Networking - The networking integrates IPV6 better.  This probably has
no impact on Mahout.

Concurrency Utilities - concurrency has some much better capabilities
in Java 7.  Fork/Join and work-stealing both make significant
differences in capabilities for threaded applications. Obviously,
we don't benefit from these capabilities in existing code, but it
would be nice to be free to use them in new code.

Java XML - JAXP, JAXB, and JAX-WS - I don't know how much this matters
given that Jackson works so very well.

Strings in switch Statements - This really helps some codes

The try-with-resources Statement - this is the biggest deal for me.
Handling exceptions and closeable resources correctly is
incredibly difficult without this (see the guava rationale for
removing closeQuietly)

Catching Multiple Exception Types and Rethrowing Exceptions with
Improved Type Checking - this makes a lot of code much simpler.
Exception handling code is verbose and difficult to get right with
many branches doing nearly the same thing.

Underscores in Numeric Literals - cute.  Helps readability.

Type Inference for Generic Instance Creation - For me this is a small
issue since IntelliJ does the type inference for me and hides type
parameters that I don't want to know about.

Java Virtual Machine (JVM) - The Java7 JVM seems to me to be a bit
more performant in a number of areas.  These differences aren't
night and day

Java Virtual Machine Support for Non-Java Languages - This impacts the
ability to integrate non-Java languages such as Jython and
Javascript with programs.  Little direct impact on Mahout given
our recent focus on Scala and given the fact that Scala is
reportedly jumping directly from Java6 to Java8 byte codes as of
2.12.

Garbage-First Collector - the G1 collector is heaps better (pun
intended) than previous collectors.  I have had several programs
that have long-lived data structures work much better with G1.
Server configuration is vastly simpler.

Java HotSpot Virtual Machine Performance Enhancements - these speak
for themselves.



On Wed, Aug 6, 2014 at 2:33 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 the only problem is that we are not really requiring it. We are not using
 anything of 1.7 functionality. If people compile (as i do) Mahout, they can
 compile any bytecode version they want.

 There are some 1.7 artifact dependencies in H20 but 1.7 would be required
 at run time only and only if the people are actually using h2obindings as
 dependency (which i expect majority would not care for).


 On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo ap@outlook.com wrote:

  As far as I can tell there should be no problems with declaring Java 1.7
  as the official minimum Java version for building and running Mahout.
  Are
  there any objections to this or problems that I am missing?
 
  Andy

Re: Requiring Java 1.7 for Mahout

I am not sure if it actually would require 1.7 to build either, since my
understanding dependencies are second-order and deeper, not immediate. Did
you try to compile it yet?


On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com wrote:

 also sorry- btw- I assuming 1500 will be merged..

  From: ap@outlook.com
  To: dev@mahout.apache.org
  Subject: RE: Requiring Java 1.7 for Mahout
  Date: Wed, 6 Aug 2014 16:56:39 -0400
 
  oracle?
 
   Date: Wed, 6 Aug 2014 13:54:43 -0700
   Subject: Re: Requiring Java 1.7 for Mahout
   From: dlie...@gmail.com
   To: dev@mahout.apache.org
  
   or testing.
  
  
   On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  
My current java is 1.6.0_38, i have no problem building.
   
   
On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com
 wrote:
   
you're right- my big concern is that on our (probably outdated)
 building
from source page we have 1.6 listed:
   
http://mahout.apache.org/developers/buildingmahout.html
   
The obvious simple fix here is to make the quick change on the
 webpage to
1.7 in order to build and test successfully.
   
I do remember something about being limited to our current lucene
 version
though by 1.6 so i am wondering if this is may be a good time to
 push or
require 1.7.
   
Just checking our bases, so I'll drop it if there's no problem here.
   
Thanks
   
   
   
   
   
 Date: Wed, 6 Aug 2014 13:33:19 -0700
 Subject: Re: Requiring Java 1.7 for Mahout
 From: dlie...@gmail.com
 To: dev@mahout.apache.org

 the only problem is that we are not really requiring it. We are
 not
using
 anything of 1.7 functionality. If people compile (as i do)
 Mahout, they
can
 compile any bytecode version they want.

 There are some 1.7 artifact dependencies in H20 but 1.7 would be
required
 at run time only and only if the people are actually using
 h2obindings
as
 dependency (which i expect majority would not care for).


 On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo 
 ap@outlook.com
wrote:

  As far as I can tell there should be no problems with declaring
 Java
1.7
  as the official minimum Java version for building and running
 Mahout.
 Are
  there any objections to this or problems that I am missing?
 
  Andy

RE: Requiring Java 1.7 for Mahout

It does not require 1.7 to build. I've been running 1.6 as well.  I did compile 
m-1500.  It builds fine with 1.6, but tests fail (only the h2o module- as you 
said due to the h2o artifact being built w 1.7). My thinking is that we don't 
want new Mahout users building with 1.6, having tests fail and walking away.  
Can we release with failing tests (even if its 1.6 specific)?

As well if there were other issues with 1.6 holding us back, 1.6 is getting old 
and there's no real drawbacks,  maybe we should consider 1.7 as an official 
version.  Or as I said just make a quick fix on the building from source page.  


 

 Date: Wed, 6 Aug 2014 14:21:36 -0700
 Subject: Re: Requiring Java 1.7 for Mahout
 From: dlie...@gmail.com
 To: dev@mahout.apache.org
 
 I am not sure if it actually would require 1.7 to build either, since my
 understanding dependencies are second-order and deeper, not immediate. Did
 you try to compile it yet?
 
 
 On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com wrote:
 
  also sorry- btw- I assuming 1500 will be merged..
 
   From: ap@outlook.com
   To: dev@mahout.apache.org
   Subject: RE: Requiring Java 1.7 for Mahout
   Date: Wed, 6 Aug 2014 16:56:39 -0400
  
   oracle?
  
Date: Wed, 6 Aug 2014 13:54:43 -0700
Subject: Re: Requiring Java 1.7 for Mahout
From: dlie...@gmail.com
To: dev@mahout.apache.org
   
or testing.
   
   
On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
   
 My current java is 1.6.0_38, i have no problem building.


 On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo ap@outlook.com
  wrote:

 you're right- my big concern is that on our (probably outdated)
  building
 from source page we have 1.6 listed:

 http://mahout.apache.org/developers/buildingmahout.html

 The obvious simple fix here is to make the quick change on the
  webpage to
 1.7 in order to build and test successfully.

 I do remember something about being limited to our current lucene
  version
 though by 1.6 so i am wondering if this is may be a good time to
  push or
 require 1.7.

 Just checking our bases, so I'll drop it if there's no problem here.

 Thanks





  Date: Wed, 6 Aug 2014 13:33:19 -0700
  Subject: Re: Requiring Java 1.7 for Mahout
  From: dlie...@gmail.com
  To: dev@mahout.apache.org
 
  the only problem is that we are not really requiring it. We are
  not
 using
  anything of 1.7 functionality. If people compile (as i do)
  Mahout, they
 can
  compile any bytecode version they want.
 
  There are some 1.7 artifact dependencies in H20 but 1.7 would be
 required
  at run time only and only if the people are actually using
  h2obindings
 as
  dependency (which i expect majority would not care for).
 
 
  On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo 
  ap@outlook.com
 wrote:
 
   As far as I can tell there should be no problems with declaring
  Java
 1.7
   as the official minimum Java version for building and running
  Mahout.
  Are
   there any objections to this or problems that I am missing?
  
   Andy

Re: Requiring Java 1.7 for Mahout

sorry if this is not adding to the discussion.

Based on what you are saying, my feeling is that all that is false dilemma.
(Assuming h20bindings also compile with 1.6 and we will find a way to iron
out 1.6 test issue easily enough. If not, bummer then.)

Requiring != supporting. Current master supports both things in terms of
build/runtime compatibility. Requiring 1.7 means supporting only one thing.
From where i come from, having two things is usually better than having
one.

Unless one of two things is given away in favor of something substantially
better.

The only such thing presumably worth the sacrifice would be new code
contributions to Mahout that absolutely require 1.7 for semantic reasons.
(since runtime 1.7 and i suspect even 1.8 are already supported). Some new
dependencies that come only in 1.7 artifacts might be another one.

As it stands, Mahout's master branch currently has exactly 0 semantically
1.7 or 1.8 java in its code base. Until such contribution appears, the
issue seems moot (not sure about 1500, never tried to compile it, this
might be a valid reason to move up.) And it is not terribly likely to
appear because Mahout is leaning towards scala contributions now. So new
substantial java contributions are not terribly likely.

Also, in the community where i twit there's general sense that there's
nothing in 1.7 or 1.8 that upends Scala in any meaningful way, so the tools
around will likely see just more scala based stuff (scalding, spark, Mahout
algebra, MLOptimizer, Breeze to name just a few examples of the newer an
 more popular scala stuff ). On java side, on the other hand, there's been
practically no new projects introduced of similar scale in past couple
years.


On Wed, Aug 6, 2014 at 2:39 PM, Andrew Palumbo ap@outlook.com wrote:

 It does not require 1.7 to build. I've been running 1.6 as well.  I did
 compile m-1500.  It builds fine with 1.6, but tests fail (only the h2o
 module- as you said due to the h2o artifact being built w 1.7). My thinking
 is that we don't want new Mahout users building with 1.6, having tests fail
 and walking away.  Can we release with failing tests (even if its 1.6
 specific)?

 As well if there were other issues with 1.6 holding us back, 1.6 is
 getting old and there's no real drawbacks,  maybe we should consider 1.7 as
 an official version.  Or as I said just make a quick fix on the building
 from source page.




  Date: Wed, 6 Aug 2014 14:21:36 -0700
  Subject: Re: Requiring Java 1.7 for Mahout
  From: dlie...@gmail.com
  To: dev@mahout.apache.org
 
  I am not sure if it actually would require 1.7 to build either, since my
  understanding dependencies are second-order and deeper, not immediate.
 Did
  you try to compile it yet?
 
 
  On Wed, Aug 6, 2014 at 2:01 PM, Andrew Palumbo ap@outlook.com
 wrote:
 
   also sorry- btw- I assuming 1500 will be merged..
  
From: ap@outlook.com
To: dev@mahout.apache.org
Subject: RE: Requiring Java 1.7 for Mahout
Date: Wed, 6 Aug 2014 16:56:39 -0400
   
oracle?
   
 Date: Wed, 6 Aug 2014 13:54:43 -0700
 Subject: Re: Requiring Java 1.7 for Mahout
 From: dlie...@gmail.com
 To: dev@mahout.apache.org

 or testing.


 On Wed, Aug 6, 2014 at 1:54 PM, Dmitriy Lyubimov 
 dlie...@gmail.com
   wrote:

  My current java is 1.6.0_38, i have no problem building.
 
 
  On Wed, Aug 6, 2014 at 1:52 PM, Andrew Palumbo 
 ap@outlook.com
   wrote:
 
  you're right- my big concern is that on our (probably outdated)
   building
  from source page we have 1.6 listed:
 
  http://mahout.apache.org/developers/buildingmahout.html
 
  The obvious simple fix here is to make the quick change on the
   webpage to
  1.7 in order to build and test successfully.
 
  I do remember something about being limited to our current
 lucene
   version
  though by 1.6 so i am wondering if this is may be a good time to
   push or
  require 1.7.
 
  Just checking our bases, so I'll drop it if there's no problem
 here.
 
  Thanks
 
 
 
 
 
   Date: Wed, 6 Aug 2014 13:33:19 -0700
   Subject: Re: Requiring Java 1.7 for Mahout
   From: dlie...@gmail.com
   To: dev@mahout.apache.org
  
   the only problem is that we are not really requiring it. We
 are
   not
  using
   anything of 1.7 functionality. If people compile (as i do)
   Mahout, they
  can
   compile any bytecode version they want.
  
   There are some 1.7 artifact dependencies in H20 but 1.7 would
 be
  required
   at run time only and only if the people are actually using
   h2obindings
  as
   dependency (which i expect majority would not care for).
  
  
   On Wed, Aug 6, 2014 at 1:28 PM, Andrew Palumbo 
   ap@outlook.com
  wrote:
  
As far as I can tell there should be no problems with
 declaring
   Java
  1.7
as the

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088350#comment-14088350
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51404288
  
So, can we move the package please?


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088365#comment-14088365
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51405333
  
Absolutely!  I was just getting ready to do this and write some tests. 


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088367#comment-14088367
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51405454
  
Am hoping to see how (short) the Spark and H2O tests compare 


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1597) A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side


[ 
https://issues.apache.org/jira/browse/MAHOUT-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088369#comment-14088369
 ] 

Hudson commented on MAHOUT-1597:


SUCCESS: Integrated in Mahout-Quality #2732 (See 
[https://builds.apache.org/job/Mahout-Quality/2732/])
MAHOUT-1597: A + 1.0 (fixes) (dlyubimov: rev 
7a50a291b4598e9809f9acf609b92175ce7f953b)
* spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala


 A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing 
 rows, Spark side
 

 Key: MAHOUT-1597
 URL: https://issues.apache.org/jira/browse/MAHOUT-1597
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 {code}
 // Concoct an rdd with missing rows
 val aRdd: DrmRdd[Int] = sc.parallelize(
   0 - dvec(1, 2, 3) ::
   3 - dvec(3, 4, 5) :: Nil
 ).map { case (key, vec) = key - (vec: Vector)}
 val drmA = drmWrap(rdd = aRdd)
 val controlB = inCoreA + 1.0
 val drmB = drmA + 1.0
 (drmB -: controlB).norm should be  1e-10
 {code}
 should not fail.
 it was failing due to elementwise scalar operator only evaluates rows 
 actually present in dataset. 
 In case of Int-keyed row matrices, there are implied rows that yet may not be 
 present in RDD. 
 Our goal is to detect the condition and evaluate missing rows prior to 
 physical operators that don't work with missing implied rows.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Requiring Java 1.7 for Mahout

On Wed, Aug 6, 2014 at 3:48 PM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 It should work fine with Java 1.7. Mahout's presently at Lucene 4.6.x and
 Lucene versions = 4.7 mandate JDK 1.7.


For what it's worth, the current version of Lucene is 4.9.

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088376#comment-14088376
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51406299
  
Actually I'm not sure if this would work against H2O, as the code is doing 

  observationsPerLabel.map(new MatrixOps(_).colSums)

which happens on RDD (and not on DRM, because of the implicit conversion). 
We would need to generic'ize that somehow.


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088382#comment-14088382
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51406631
  
@andrewpalumbo can you perhaps comment on the code itself so we see what 
you are talking about?


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088385#comment-14088385
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51406691
  
Oops, I misread.. the map() happens on Array, my bad!

I must admit I do not (yet) know how this code is working on DRM in a 
distributed way (i.e, to compare two backends)


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088391#comment-14088391
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908429
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
--- End diff --

Imports look engine-independent. Should not be Spark-coupled. 
Imports probably should include implicit operations per document (which is 
why the code does weird stuff like `new MatrixOps(m)`).

The standard recommended way to do imports for engine -independent code

 import org.apache.mahout.math._
 import scalabindings._
 import RLikeOps._
 import drm._
 import RLikeDrmOps._

if java collections are used (e.g. something like `for (row - matrix) { 
... }`) then it also would need 

import collection._
import JavaConversions._

to enable all implicits.


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088394#comment-14088394
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908475
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
+import scala.reflect.ClassTag
+
+/**
+ * Distributed training of a Naive Bayes model. Follows the approach 
presented in Rennie et.al.: Tackling the poor
+ * assumptions of Naive Bayes Text classifiers, ICML 2003, 
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+ */
+object NaiveBayes {
+
+  /** default value for the smoothing parameter */
+  def defaultAlphaI = 1f
+
+  /**
+   * Distributed training of a Naive Bayes model.
+   *
+   * @param observationsPerLabel an array of matrices. Every matrix 
contains the observations for a particular label.
+   * @param trainComplementary whether to train a complementary Naive 
Bayes model
+   * @param alphaI smoothing parameter
+   * @return trained naive bayes model
+   */
+  def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], 
trainComplementary :Boolean = true,
+   alphaI: Float = defaultAlphaI): 
NaiveBayesModel = {
+
+// distributed summation of all observations per label
+val weightsPerLabelAndFeature = 
scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
--- End diff --

if imports are properly done, this should just be 

 val weightsPerLabelAndFeature = 
dense(observationsPerLabel.map(_.colSums))

Note that this is not Spark-dependent code. the `map` here is Scala 
collection map.


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088401#comment-14088401
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908555
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
+import scala.reflect.ClassTag
+
+/**
+ * Distributed training of a Naive Bayes model. Follows the approach 
presented in Rennie et.al.: Tackling the poor
+ * assumptions of Naive Bayes Text classifiers, ICML 2003, 
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+ */
+object NaiveBayes {
+
+  /** default value for the smoothing parameter */
+  def defaultAlphaI = 1f
+
+  /**
+   * Distributed training of a Naive Bayes model.
+   *
+   * @param observationsPerLabel an array of matrices. Every matrix 
contains the observations for a particular label.
+   * @param trainComplementary whether to train a complementary Naive 
Bayes model
+   * @param alphaI smoothing parameter
+   * @return trained naive bayes model
+   */
+  def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], 
trainComplementary :Boolean = true,
+   alphaI: Float = defaultAlphaI): 
NaiveBayesModel = {
+
+// distributed summation of all observations per label
+val weightsPerLabelAndFeature = 
scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
+// local summation of all weights per feature
+val weightsPerFeature = new 
MatrixOps(weightsPerLabelAndFeature).colSums
+// local summation of all weights per label
+val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums
+
+// perLabelThetaNormalizer Vector is expected by NaiveBayesModel. We 
can pass a null value
+// in the case of a standard NB model
+var thetaNormalizer: org.apache.mahout.math.Vector= null
+
+// instantiate a trainer and retrieve the perLabelThetaNormalizer 
Vector from it in the case of
+// a complementary NB model
+if( trainComplementary ){
+  val thetaTrainer = new ComplementaryThetaTrainer(weightsPerFeature, 
weightsPerLabel, alphaI)
+  // local training of the theta normalization
+  for (labelIndex - 0 until new 
MatrixOps(weightsPerLabelAndFeature).nrow) {
+thetaTrainer.train(labelIndex, 
weightsPerLabelAndFeature.viewRow(labelIndex))
--- End diff --

in Mahout Scala, this slicing should look 

... weightsPerLabelAndFeature(labelIndex, ::)


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088397#comment-14088397
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908508
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
+import scala.reflect.ClassTag
+
+/**
+ * Distributed training of a Naive Bayes model. Follows the approach 
presented in Rennie et.al.: Tackling the poor
+ * assumptions of Naive Bayes Text classifiers, ICML 2003, 
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+ */
+object NaiveBayes {
+
+  /** default value for the smoothing parameter */
+  def defaultAlphaI = 1f
+
+  /**
+   * Distributed training of a Naive Bayes model.
+   *
+   * @param observationsPerLabel an array of matrices. Every matrix 
contains the observations for a particular label.
+   * @param trainComplementary whether to train a complementary Naive 
Bayes model
+   * @param alphaI smoothing parameter
+   * @return trained naive bayes model
+   */
+  def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], 
trainComplementary :Boolean = true,
+   alphaI: Float = defaultAlphaI): 
NaiveBayesModel = {
+
+// distributed summation of all observations per label
+val weightsPerLabelAndFeature = 
scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
+// local summation of all weights per feature
+val weightsPerFeature = new 
MatrixOps(weightsPerLabelAndFeature).colSums
+// local summation of all weights per label
+val weightsPerLabel = new MatrixOps(weightsPerLabelAndFeature).rowSums
--- End diff --

Same thing here. now need for `new MatrixOps`...  here and elsewhere 


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088400#comment-14088400
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51407423
  
Yes-- and as a reminder  this code is  a compilation of  patches that were 
writen before the abastraction away from spark (not by myself).  I've not 
looked at it too closely, and just put it up for comment and feedback for the 
Berlin TU students.   I've been away almost the entire summer and had planned 
on doing some work on this to get back up to speed and to try out the H2o 
bindings.  

So please- yes any comments are welcome. 


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088396#comment-14088396
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908489
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
+import scala.reflect.ClassTag
+
+/**
+ * Distributed training of a Naive Bayes model. Follows the approach 
presented in Rennie et.al.: Tackling the poor
+ * assumptions of Naive Bayes Text classifiers, ICML 2003, 
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+ */
+object NaiveBayes {
+
+  /** default value for the smoothing parameter */
+  def defaultAlphaI = 1f
+
+  /**
+   * Distributed training of a Naive Bayes model.
+   *
+   * @param observationsPerLabel an array of matrices. Every matrix 
contains the observations for a particular label.
+   * @param trainComplementary whether to train a complementary Naive 
Bayes model
+   * @param alphaI smoothing parameter
+   * @return trained naive bayes model
+   */
+  def trainNB[K: ClassTag](observationsPerLabel: Array[DrmLike[K]], 
trainComplementary :Boolean = true,
+   alphaI: Float = defaultAlphaI): 
NaiveBayesModel = {
+
+// distributed summation of all observations per label
+val weightsPerLabelAndFeature = 
scalabindings.dense(observationsPerLabel.map(new MatrixOps(_).colSums))
+// local summation of all weights per feature
+val weightsPerFeature = new 
MatrixOps(weightsPerLabelAndFeature).colSums
--- End diff --

Similarly this should be just 

 val weightsPerFeature = weightsPerLabelAndFeature.colSums


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088415#comment-14088415
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/32#discussion_r15908947
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/classification/NaiveBayes.scala
 ---
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.sparkbindings.drm.classification
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.classifier.naivebayes.NaiveBayesModel
+import 
org.apache.mahout.classifier.naivebayes.training.ComplementaryThetaTrainer
+
+import scala.reflect.ClassTag
+
+/**
+ * Distributed training of a Naive Bayes model. Follows the approach 
presented in Rennie et.al.: Tackling the poor
+ * assumptions of Naive Bayes Text classifiers, ICML 2003, 
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+ */
+object NaiveBayes {
+
+  /** default value for the smoothing parameter */
+  def defaultAlphaI = 1f
--- End diff --

Mahout convention is to write these as `1.0` rather than a float.


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

CoocurrenceAnalysis[Suite].scala - math-scala (?)

Sorry can't recollect what this discussion ended with. Why are we not
moving these files to math-scala? the code seems to be engine-independent.

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x

[
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088433#comment-14088433
]

ASF GitHub Bot commented on MAHOUT-1603:

Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51409540

On Wed, Aug 6, 2014 at 3:55 PM, Pat Ferrel notificati...@github.com wrote:

Sorry was off the internet during a move (curse you giant nameless cable
company!)

Anyway these tests are substantially changed in #36
https://github.com/apache/mahout/pull/36 but I haven't been able to get
the new build until now, will check and push 36 first.

As to building and tearing down contexts I'm not helping things. For each
driver test DistributedSparkSuite in the beforeEach creates a context so I
use that to start the test. Then the driver I am using needs to start a
context so for every time I call a driver I precede it with the
afterEach
call to shut down the context. Then call the driver, then call
beforeEach
to restore the test context. I also had to tell the driver in a special
invisible option not to load Mahout jars with a --dontAddMahoutJars. So
the context is being built 3 times for every test. but that hasn't
changed,
it's always been that way.

We could reuse a single context per test but it would require disabling
some stuff in the driver along the lines of what I had to do with
--dontAddMahoutJars. Since I've already had to do this I don't think it
would be a big deal to disable a little more. I'll look at it once 36 is
pushed.

Is there any reason to build the context more than once per suite?

Usually, there's not and that's exactly what this branch is moving towards
(note: this PR is not against master but to to a side branch called
`spark-1.0.x`).
Also that's what they seem to have done in Spark 1.0 as well.

There are sometimes (in my other projects) a need to create a custom
context but not in Mahout codebase.

Seems like if I disable the context things in the driver we could run all
tests in a single context, right?

Right. This branch has already switched to doing that. All algebra tests
seem to be fine but these tests are failing now. not sure why. seems
functional to me.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/mahout/pull/40#issuecomment-51408987.

Tweaks for Spark 1.0.x
---

Key: MAHOUT-1603
URL: https://issues.apache.org/jira/browse/MAHOUT-1603
Project: Mahout
Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Fix For: 1.0

Tweaks necessary current codebase on top of spark 1.0.x

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088443#comment-14088443
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51410370
  
OK so DistributedSparkSuite moved the create context into the beforeAll?


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088452#comment-14088452
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51410605
  
 OK so DistributedSparkSuite moved the create context into the beforeAll?

on this branch, yes.


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

co-occurrence paper and code

So, compared to original paper [1], similarity is now hardcoded and always
LLR? Do we have any plans to parameterize that further? Is there any reason
to parameterize it?


Also, reading the paper, i am a bit wondering -- similarity and distance
are functions that usually are moving into different directions (i.e.
cosine similarity and angular distance) but in the paper distance scores
are also considered similarities? How's that?

I suppose in that context LLR is considered a distance (higher scores mean
more `distant` items, co-occurring by chance only)?

[1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

-d

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

I suppose in that context LLR is considered a distance (higher scores mean
 more `distant` items, co-occurring by chance only)?


Self-correction on this one -- having given a quick look at llr paper
again, it looks like it is actually a similarity (higher scores meaning
more stable co-occurrences, i.e. it moves in the opposite direction of
 p-value if it had been a classic  test


 [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

 -d

Re: co-occurrence paper and code

The entire reference to similarity harks back to the original formulation
of the MovieLens and Firefox recommenders which looked for similarity of
rating patterns.  That made some sense then, but it is a bit of a tortured
turn of phrase when other formulations of recommendation are used.

There are currently two general approaches that seem to be generating
reasonable recommendation results in practice, LLR based sparsification of
cooccurrence and cross-occurrence matrices and matrix completion techniques
typically implemented as some form of factorization.  The enormous number
of options that Mahout's map-reduce recommender implements have little
practical utility and are more of an artifact of a desire to implement most
of the research algorithms in a single framework.

The concept of distance can be useful in the matrix factorization since it
allows efficient algorithms to be derived.  But with the sparsification
problem, the concepts of similarity and distance break down because with
cooccurrence we don't just have two answers.  Instead, we have three:
anomalous cooccurrence, non-anomalous cooccurrence and insufficient data.
 For the purposes of sparsification, we lump non-anomalous cooccurrence and
insufficient data together, but this lumping has the side effect that the
score that we get is not a useful measure of association, distance or
similarity.  Instead, we just put down that anomalously cooccurrent pairs
are anomalous (a binary decision) and leave the weighting of them until
later.

If you are strict about thinking about cooccurrence measures as a distance,
you get into measures of the strength of association.  These measures will
separate anomalous cooccurrence from non-anomalous cooccurrence, but they
will smear the insufficient data cases into both options.  Since most pairs
have insufficient data, this will be a relatively disastrous thing to do,
causing massive numbers of false positives that swamp the valid pairs.  The
virtue of LLR is that it does not do this, but there is a corollary vice in
that the resulting score is not useful as a distance.


Regarding the question about similarities and distances being used
essentially synonymously, this is relatively common because of the fairly
strict anti-correlation between them.  Yes, there is a sign change, but
they still are representing basically the same thing.  Elevation and depth
are similarly very closely related and somebody might refer to the
elevation of an underwater mountain range above its base or its depth below
the surface.  These expressions are referring to the same z axis
measurement with different origins.




On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 So, compared to original paper [1], similarity is now hardcoded and always
 LLR? Do we have any plans to parameterize that further? Is there any reason
 to parameterize it?


 Also, reading the paper, i am a bit wondering -- similarity and distance
 are functions that usually are moving into different directions (i.e.
 cosine similarity and angular distance) but in the paper distance scores
 are also considered similarities? How's that?

 I suppose in that context LLR is considered a distance (higher scores mean
 more `distant` items, co-occurring by chance only)?

 [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

 -d

Re: co-occurrence paper and code

is this

val bcastNumInteractions =
drmBroadcast(drmI.numNonZeroElementsPerColumn)

any different from just saying `drmI.colSums`?


On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:




 On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 I suppose in that context LLR is considered a distance (higher scores mean
 more `distant` items, co-occurring by chance only)?


 Self-correction on this one -- having given a quick look at llr paper
 again, it looks like it is actually a similarity (higher scores meaning
 more stable co-occurrences, i.e. it moves in the opposite direction of
  p-value if it had been a classic  test


 [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

 -d

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088486#comment-14088486
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51413783
  
Do you want to push this with the ignores and I'll fix them to use the 
new DistributedSparkSuite as it gets into the master?

BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7 and 
Spark 1? 


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis


[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088487#comment-14088487
 ] 

ASF GitHub Bot commented on MAHOUT-1541:


Github user pferrel closed the pull request at:

https://github.com/apache/mahout/pull/36


 Create CLI Driver for Spark Cooccurrence Analysis
 -

 Key: MAHOUT-1541
 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
 Project: Mahout
  Issue Type: New Feature
  Components: CLI
Reporter: Pat Ferrel
Assignee: Pat Ferrel

 Create a CLI driver to import data in a flexible manner, create an 
 IndexedDataset with BiMap ID translation dictionaries, call the Spark 
 CooccurrenceAnalysis with the appropriate params, then write output with 
 external IDs optionally reattached.
 Ultimately it should be able to read input as the legacy mr does but will 
 support reading externally defined IDs and flexible formats. Output will be 
 of the legacy format or text files of the user's specification with 
 reattached Item IDs. 
 Support for legacy formats is a question, users can always use the legacy 
 code if they want this. Internal to the IndexedDataset is a Spark DRM so 
 pipelining can be accomplished without any writing to an actual file so the 
 legacy sequence file output may not be needed.
 Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 I suppose in that context LLR is considered a distance (higher scores mean
  more `distant` items, co-occurring by chance only)?
 

 Self-correction on this one -- having given a quick look at llr paper
 again, it looks like it is actually a similarity (higher scores meaning
 more stable co-occurrences, i.e. it moves in the opposite direction of
  p-value if it had been a classic  test


LLR is a classic test.  It is essentially Pearson's chi^2 test without the
normal approximation.  See my papers[1][2] introducing the test into
computational linguistics (which ultimately brought it into all kinds of
fields including recommendations) and also references for the G^2 test[3].

[1] http://www.aclweb.org/anthology/J93-1003
[2] http://arxiv.org/abs/1207.1847
[3] http://en.wikipedia.org/wiki/G-test

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
   more `distant` items, co-occurring by chance only)?
  
 
  Self-correction on this one -- having given a quick look at llr paper
  again, it looks like it is actually a similarity (higher scores meaning
  more stable co-occurrences, i.e. it moves in the opposite direction of
   p-value if it had been a classic  test
 

 LLR is a classic test.


What i meant here it doesn't produce a p-value. or does it?


 It is essentially Pearson's chi^2 test without the
 normal approximation.  See my papers[1][2] introducing the test into
 computational linguistics (which ultimately brought it into all kinds of
 fields including recommendations) and also references for the G^2 test[3].

 [1] http://www.aclweb.org/anthology/J93-1003
 [2] http://arxiv.org/abs/1207.1847
 [3] http://en.wikipedia.org/wiki/G-test

[jira] [Commented] (MAHOUT-1493) Port Naive Bayes to the Spark DSL


[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088492#comment-14088492
 ] 

ASF GitHub Bot commented on MAHOUT-1493:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/32#issuecomment-51414355
  
@dlyubimov thanks for all the comments I'm going to try to get the changes 
in shortly.  


 Port Naive Bayes to the Spark DSL
 -

 Key: MAHOUT-1493
 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
 MAHOUT-1493.patch, MAHOUT-1493a.patch


 Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
 more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: co-occurrence paper and code

2014-08-06 Thread Pat Ferrel

Yes because the matrix A’A is not necessarily boolean. The actual value is 
ignored but it’s in the matrix so the colSums was not correct.

On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

is this

   val bcastNumInteractions =
drmBroadcast(drmI.numNonZeroElementsPerColumn)

any different from just saying `drmI.colSums`?


On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 
 
 
 On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 I suppose in that context LLR is considered a distance (higher scores mean
 more `distant` items, co-occurring by chance only)?
 
 
 Self-correction on this one -- having given a quick look at llr paper
 again, it looks like it is actually a similarity (higher scores meaning
 more stable co-occurrences, i.e. it moves in the opposite direction of
 p-value if it had been a classic  test
 
 
 [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
 
 -d

Re: co-occurrence paper and code

i thought `drmI` argument here meant indicator matrix (i.e. 1.0 or 0.0 ) ?


On Wed, Aug 6, 2014 at 5:03 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yes because the matrix A’A is not necessarily boolean. The actual value is
 ignored but it’s in the matrix so the colSums was not correct.

 On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 is this

val bcastNumInteractions =
 drmBroadcast(drmI.numNonZeroElementsPerColumn)

 any different from just saying `drmI.colSums`?


 On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 
 
 
  On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
  more `distant` items, co-occurring by chance only)?
 
 
  Self-correction on this one -- having given a quick look at llr paper
  again, it looks like it is actually a similarity (higher scores meaning
  more stable co-occurrences, i.e. it moves in the opposite direction of
  p-value if it had been a classic  test
 
 
  [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
 
  -d

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

  LLR is a classic test.


 What i meant here it doesn't produce a p-value. or does it?


It produces an asymptotically chi^2 distributed statistic with 1-degree of
freedom (for our case of 2x2 contingency tables) which can be reduced
trivially to a p-value in the standard way.

It is as much a classic test as a t-test, a chi^2 test, an F test or any
other of the gazillion usual suspects.

Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter

I chose against porting all the similarity measures to the dsl version of
the cooccurrence analysis for two reasons. First, adding the measures in a
generalizable way makes the code superhard to read. Second, in practice, I
have never seen something giving better results than llr. As Ted pointed
out, a lot of the foundations of using similarity measures comes from
wanting to predict ratings, which people never do in practice. I think we
should restrict ourselves to approaches that work with implicit, count-like
data.

-s
Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:

 On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
   more `distant` items, co-occurring by chance only)?
  
 
  Self-correction on this one -- having given a quick look at llr paper
  again, it looks like it is actually a similarity (higher scores meaning
  more stable co-occurrences, i.e. it moves in the opposite direction of
   p-value if it had been a classic  test
 

 LLR is a classic test.  It is essentially Pearson's chi^2 test without the
 normal approximation.  See my papers[1][2] introducing the test into
 computational linguistics (which ultimately brought it into all kinds of
 fields including recommendations) and also references for the G^2 test[3].

 [1] http://www.aclweb.org/anthology/J93-1003
 [2] http://arxiv.org/abs/1207.1847
 [3] http://en.wikipedia.org/wiki/G-test

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 6:03 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yes because the matrix A’A is not necessarily boolean. The actual value is
 ignored but it’s in the matrix so the colSums was not correct.

 On Aug 6, 2014, at 4:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 is this

val bcastNumInteractions =
 drmBroadcast(drmI.numNonZeroElementsPerColumn)

 any different from just saying `drmI.colSums`?


Ignore what I said and listen to this guy.

I forgot that this was after the cooccurrence counting.

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 4:53 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 strict anti-correlation between them.  Yes, there is a sign change, but
 they still are representing basically the same thing.  Elevation and depth


Sure. This is basic knowledge. The reason i asked is because the original
paper gives formulation without sign change in section 4.4 (e.g. cosine
similarity and manhattan distance formulas) and bills it as a functional
parameter to similarity calculation. Which would seem to result in an
technical error as described there since they make no mention about this
distinction at all.

Just was wondering if this was compensated for somewhere else that i don't
immediately see.





 On Wed, Aug 6, 2014 at 5:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  So, compared to original paper [1], similarity is now hardcoded and
 always
  LLR? Do we have any plans to parameterize that further? Is there any
 reason
  to parameterize it?
 
 
  Also, reading the paper, i am a bit wondering -- similarity and distance
  are functions that usually are moving into different directions (i.e.
  cosine similarity and angular distance) but in the paper distance scores
  are also considered similarities? How's that?
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
  more `distant` items, co-occurring by chance only)?
 
  [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
 
  -d

Re: co-occurrence paper and code

We went around this may-pole a while ago.  It is the same if the matrix is
binary.  It isn't otherwise.  Whether this code might someday be used in a
context with non-binary inputs is an open question.  Likewise, whether it
is worth saving some time by omitting a thresholding operation to binarize
the matrix is similarly not clear.

My feeling is that assuming a binary matrix is fine.



On Wed, Aug 6, 2014 at 5:54 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 is this

 val bcastNumInteractions =
 drmBroadcast(drmI.numNonZeroElementsPerColumn)

 any different from just saying `drmI.colSums`?


 On Wed, Aug 6, 2014 at 4:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 
 
 
  On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
  more `distant` items, co-occurring by chance only)?
 
 
  Self-correction on this one -- having given a quick look at llr paper
  again, it looks like it is actually a similarity (higher scores meaning
  more stable co-occurrences, i.e. it moves in the opposite direction of
   p-value if it had been a classic  test
 
 
  [1] http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf
 
  -d

Re: co-occurrence paper and code

On Wed, Aug 6, 2014 at 5:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Aug 6, 2014 at 6:01 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

   LLR is a classic test.
 
 
  What i meant here it doesn't produce a p-value. or does it?
 

 It produces an asymptotically chi^2 distributed statistic with 1-degree of
 freedom (for our case of 2x2 contingency tables) which can be reduced
 trivially to a p-value in the standard way.


Great. so that means that we can do h_0 rejection based on a %-expressed
level?

Re: co-occurrence paper and code

Asking because i am considering pulling this implementation but for some
(mostly political) reasons people want to try different things here.

I may also have to start with a different way of constructing
co-occurrences, and may do a few optimizations there (i.e. priority queue
queing/enqueing does twice the work it really needs to do etc.)




On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter ssc.o...@googlemail.com
wrote:

 I chose against porting all the similarity measures to the dsl version of
 the cooccurrence analysis for two reasons. First, adding the measures in a
 generalizable way makes the code superhard to read. Second, in practice, I
 have never seen something giving better results than llr. As Ted pointed
 out, a lot of the foundations of using similarity measures comes from
 wanting to predict ratings, which people never do in practice. I think we
 should restrict ourselves to approaches that work with implicit, count-like
 data.

 -s
 Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:

  On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
   I suppose in that context LLR is considered a distance (higher scores
  mean
more `distant` items, co-occurring by chance only)?
   
  
   Self-correction on this one -- having given a quick look at llr paper
   again, it looks like it is actually a similarity (higher scores meaning
   more stable co-occurrences, i.e. it moves in the opposite direction of
p-value if it had been a classic  test
  
 
  LLR is a classic test.  It is essentially Pearson's chi^2 test without
 the
  normal approximation.  See my papers[1][2] introducing the test into
  computational linguistics (which ultimately brought it into all kinds of
  fields including recommendations) and also references for the G^2
 test[3].
 
  [1] http://www.aclweb.org/anthology/J93-1003
  [2] http://arxiv.org/abs/1207.1847
  [3] http://en.wikipedia.org/wiki/G-test

Re: co-occurrence paper and code

what i mean here i probably need to refactor it a little so that there's
part of algorithm that accepts co-occurrence input directly and which is
somewhat decoupled from the part that accepts u x item input and does
downsampling and co-occurrence construction. So i could do some
customization of my own to co-occurrence construction. Would that be
reasonable if i do that?


On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Asking because i am considering pulling this implementation but for some
 (mostly political) reasons people want to try different things here.

 I may also have to start with a different way of constructing
 co-occurrences, and may do a few optimizations there (i.e. priority queue
 queing/enqueing does twice the work it really needs to do etc.)




 On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter 
 ssc.o...@googlemail.com wrote:

 I chose against porting all the similarity measures to the dsl version of
 the cooccurrence analysis for two reasons. First, adding the measures in a
 generalizable way makes the code superhard to read. Second, in practice, I
 have never seen something giving better results than llr. As Ted pointed
 out, a lot of the foundations of using similarity measures comes from
 wanting to predict ratings, which people never do in practice. I think we
 should restrict ourselves to approaches that work with implicit,
 count-like
 data.

 -s
 Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:

  On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
   I suppose in that context LLR is considered a distance (higher scores
  mean
more `distant` items, co-occurring by chance only)?
   
  
   Self-correction on this one -- having given a quick look at llr paper
   again, it looks like it is actually a similarity (higher scores
 meaning
   more stable co-occurrences, i.e. it moves in the opposite direction of
p-value if it had been a classic  test
  
 
  LLR is a classic test.  It is essentially Pearson's chi^2 test without
 the
  normal approximation.  See my papers[1][2] introducing the test into
  computational linguistics (which ultimately brought it into all kinds of
  fields including recommendations) and also references for the G^2
 test[3].
 
  [1] http://www.aclweb.org/anthology/J93-1003
  [2] http://arxiv.org/abs/1207.1847
  [3] http://en.wikipedia.org/wiki/G-test

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088506#comment-14088506
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51415419
  
On Wed, Aug 6, 2014 at 4:56 PM, Pat Ferrel notificati...@github.com wrote:

 Do you want to push this with the ignores and I'll fix them to use the
 new DistributedSparkSuite as it gets into the master?


No i probably don't want ot merge it with non-working tests. As usual, i
can add you as collaborator in my account (if i have not yet done so) so
you could push directly to my source branch of this (so it reflects in the
PR instantaniously) or you can PR against my spark 1.0.x branch, or you can
just send me a regular git patch with email, whichever works.

 BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7
 and Spark 1?


The reason Scala is fixed where it is fixed is because it is paired to
Spark's version of Scala. Migration between major versions of Scala is a
big deal, for Spark and otherwise. Stuff will not work. Minor version of
Scala should be generally portable.



  —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/mahout/pull/40#issuecomment-51413783.



 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter

Sounds good to me.

-s
Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com:

 what i mean here i probably need to refactor it a little so that there's
 part of algorithm that accepts co-occurrence input directly and which is
 somewhat decoupled from the part that accepts u x item input and does
 downsampling and co-occurrence construction. So i could do some
 customization of my own to co-occurrence construction. Would that be
 reasonable if i do that?


 On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Asking because i am considering pulling this implementation but for some
  (mostly political) reasons people want to try different things here.
 
  I may also have to start with a different way of constructing
  co-occurrences, and may do a few optimizations there (i.e. priority queue
  queing/enqueing does twice the work it really needs to do etc.)
 
 
 
 
  On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter 
  ssc.o...@googlemail.com wrote:
 
  I chose against porting all the similarity measures to the dsl version
 of
  the cooccurrence analysis for two reasons. First, adding the measures
 in a
  generalizable way makes the code superhard to read. Second, in
 practice, I
  have never seen something giving better results than llr. As Ted pointed
  out, a lot of the foundations of using similarity measures comes from
  wanting to predict ratings, which people never do in practice. I think
 we
  should restrict ourselves to approaches that work with implicit,
  count-like
  data.
 
  -s
  Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:
 
   On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
I suppose in that context LLR is considered a distance (higher
 scores
   mean
 more `distant` items, co-occurring by chance only)?

   
Self-correction on this one -- having given a quick look at llr
 paper
again, it looks like it is actually a similarity (higher scores
  meaning
more stable co-occurrences, i.e. it moves in the opposite direction
 of
 p-value if it had been a classic  test
   
  
   LLR is a classic test.  It is essentially Pearson's chi^2 test without
  the
   normal approximation.  See my papers[1][2] introducing the test into
   computational linguistics (which ultimately brought it into all kinds
 of
   fields including recommendations) and also references for the G^2
  test[3].
  
   [1] http://www.aclweb.org/anthology/J93-1003
   [2] http://arxiv.org/abs/1207.1847
   [3] http://en.wikipedia.org/wiki/G-test

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088513#comment-14088513
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51415479
  
Scala 2.11 port of Spark is in progress 
[https://issues.apache.org/jira/browse/SPARK-1812]


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088517#comment-14088517
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51415624
  
sure. there're tons of stuff in progress but we can only use released
artifact as dependencies.


On Wed, Aug 6, 2014 at 5:19 PM, Anand Avati notificati...@github.com
wrote:

 Scala 2.11 port of Spark is in progress [
 https://issues.apache.org/jira/browse/SPARK-1812]

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/mahout/pull/40#issuecomment-51415479.



 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088520#comment-14088520
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51415697
  
Only meant FYI (in case someone is planning anything). Of course we have to 
wait for release.


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088526#comment-14088526
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51416019
  
alternatively, you can also just give me a verbal hint what i need to fix,
and i can try to patch to the best of my ability.


On Wed, Aug 6, 2014 at 5:18 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:




 On Wed, Aug 6, 2014 at 4:56 PM, Pat Ferrel notificati...@github.com
 wrote:

 Do you want to push this with the ignores and I'll fix them to use the
 new DistributedSparkSuite as it gets into the master?


 No i probably don't want ot merge it with non-working tests. As usual, i
 can add you as collaborator in my account (if i have not yet done so) so
 you could push directly to my source branch of this (so it reflects in the
 PR instantaniously) or you can PR against my spark 1.0.x branch, or you 
can
 just send me a regular git patch with email, whichever works.

 BTW any reason we aren't doing Scala 2.11 since we are upping to Java 7
 and Spark 1?


 The reason Scala is fixed where it is fixed is because it is paired to
 Spark's version of Scala. Migration between major versions of Scala is a
 big deal, for Spark and otherwise. Stuff will not work. Minor version of
 Scala should be generally portable.



  —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/mahout/pull/40#issuecomment-51413783.





 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: co-occurrence paper and code

2014-08-06 Thread Pat Ferrel

BTW the cooccurrence code is going into RSJ too and there are uses of that 
where cosine is expected. I don’t know how to think about cross-cosine. Is 
there an argument for LLR only in RSJ?

On Aug 6, 2014, at 5:20 PM, Sebastian Schelter ssc.o...@googlemail.com wrote:

Sounds good to me.

-s
Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com:

 what i mean here i probably need to refactor it a little so that there's
 part of algorithm that accepts co-occurrence input directly and which is
 somewhat decoupled from the part that accepts u x item input and does
 downsampling and co-occurrence construction. So i could do some
 customization of my own to co-occurrence construction. Would that be
 reasonable if i do that?
 
 
 On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 Asking because i am considering pulling this implementation but for some
 (mostly political) reasons people want to try different things here.
 
 I may also have to start with a different way of constructing
 co-occurrences, and may do a few optimizations there (i.e. priority queue
 queing/enqueing does twice the work it really needs to do etc.)
 
 
 
 
 On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter 
 ssc.o...@googlemail.com wrote:
 
 I chose against porting all the similarity measures to the dsl version
 of
 the cooccurrence analysis for two reasons. First, adding the measures
 in a
 generalizable way makes the code superhard to read. Second, in
 practice, I
 have never seen something giving better results than llr. As Ted pointed
 out, a lot of the foundations of using similarity measures comes from
 wanting to predict ratings, which people never do in practice. I think
 we
 should restrict ourselves to approaches that work with implicit,
 count-like
 data.
 
 -s
 Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:
 
 On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 
 wrote:
 
 I suppose in that context LLR is considered a distance (higher
 scores
 mean
 more `distant` items, co-occurring by chance only)?
 
 
 Self-correction on this one -- having given a quick look at llr
 paper
 again, it looks like it is actually a similarity (higher scores
 meaning
 more stable co-occurrences, i.e. it moves in the opposite direction
 of
 p-value if it had been a classic  test
 
 
 LLR is a classic test.  It is essentially Pearson's chi^2 test without
 the
 normal approximation.  See my papers[1][2] introducing the test into
 computational linguistics (which ultimately brought it into all kinds
 of
 fields including recommendations) and also references for the G^2
 test[3].
 
 [1] http://www.aclweb.org/anthology/J93-1003
 [2] http://arxiv.org/abs/1207.1847
 [3] http://en.wikipedia.org/wiki/G-test

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088538#comment-14088538
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51417124
  
Ok, I just pushed the new tests, maybe they work. Don't laugh it could 
happen.

There are likely to be problems with my calling afterEach and beforeEach 
since their meaning has changed. Fixing this will require mods to the driver 
too I expect and it'll probably be easier for me to do it.

If you are almost ready with this I'll upgrade to Spark 1.0.1 and grab your 
branch.


 Tweaks for Spark 1.0.x 
 ---

 Key: MAHOUT-1603
 URL: https://issues.apache.org/jira/browse/MAHOUT-1603
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1568) Build an I/O model that can replace sequence files for import/export


[ 
https://issues.apache.org/jira/browse/MAHOUT-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088620#comment-14088620
 ] 

Hudson commented on MAHOUT-1568:


SUCCESS: Integrated in Mahout-Quality #2733 (See 
[https://builds.apache.org/job/Mahout-Quality/2733/])
MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option 
defaults to DRY up individual driver code putting more in base classes, 
tightened up the test suite with a better way of comparing actual with correct 
(pat: rev a80974037853c5227f9e5ef1c384a1fca134746e)
* math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala
* spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
* spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
* spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
* spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
* spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
* spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
* spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
* spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
* spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
* spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala


 Build an I/O model that can replace sequence files for import/export
 

 Key: MAHOUT-1568
 URL: https://issues.apache.org/jira/browse/MAHOUT-1568
 Project: Mahout
  Issue Type: New Feature
  Components: CLI
 Environment: Scala, Spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel

 Implement mechanisms to read and write data from/to flexible stores. These 
 will support tuples streams and drms but with extensions that allow keeping 
 user defined values for IDs. The mechanism in some sense can replace Sequence 
 Files for import/export and will make the operation much easier for the user. 
 In many cases directly consuming their input files.
 Start with text delimited files for input/output in the Spark version of 
 ItemSimilarity
 A proposal is running with ItemSimilarity on Spark and is documented on the 
 github wiki here: https://github.com/pferrel/harness/wiki
 Comments are appreciated



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis


[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088622#comment-14088622
 ] 

Hudson commented on MAHOUT-1541:


SUCCESS: Integrated in Mahout-Quality #2733 (See 
[https://builds.apache.org/job/Mahout-Quality/2733/])
MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option 
defaults to DRY up individual driver code putting more in base classes, 
tightened up the test suite with a better way of comparing actual with correct 
(pat: rev a80974037853c5227f9e5ef1c384a1fca134746e)
* math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala
* spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
* spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
* spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
* spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
* spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
* spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
* spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
* spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
* spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
* spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala


 Create CLI Driver for Spark Cooccurrence Analysis
 -

 Key: MAHOUT-1541
 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
 Project: Mahout
  Issue Type: New Feature
  Components: CLI
Reporter: Pat Ferrel
Assignee: Pat Ferrel

 Create a CLI driver to import data in a flexible manner, create an 
 IndexedDataset with BiMap ID translation dictionaries, call the Spark 
 CooccurrenceAnalysis with the appropriate params, then write output with 
 external IDs optionally reattached.
 Ultimately it should be able to read input as the legacy mr does but will 
 support reading externally defined IDs and flexible formats. Output will be 
 of the legacy format or text files of the user's specification with 
 reattached Item IDs. 
 Support for legacy formats is a question, users can always use the legacy 
 code if they want this. Internal to the IndexedDataset is a Spark DRM so 
 pipelining can be accomplished without any writing to an actual file so the 
 legacy sequence file output may not be needed.
 Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1569) Create CLI driver that supports Spark jobs