Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Dmitriy Lyubimov
you want a REALLY-REALLY big matrix? as in distributed matrix?

On Thu, Sep 18, 2014 at 12:28 PM, Saikat Kanjilal 
wrote:

>
> http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
> I need to implement the above in the scala world and expose a DSL API to
> call the computation when computing the affinity matrix.
>
> > From: ted.dunn...@gmail.com
> > Date: Thu, 18 Sep 2014 10:04:34 -0700
> > Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays
> of shapes
> > To: dev@mahout.apache.org
> >
> > There are number of non-traditional linear algebra operations like this
> > that are important to implement.
> >
> > Can you describe what you intend to do so that we can discuss the shape
> of
> > the API and computation?
> >
> >
> >
> > On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
> > wrote:
> >
> > > Dmitry et al,As part of the above JIRA I need to calculate the gaussian
> > > kernel between 2 shapes, I looked through mahout-math-scala and didnt
> see
> > > anything to do this, any objections to me adding some code under
> > > scalabindings to do this?
> > > Thanks in advance.
>
>


Jenkins build is back to stable : mahout-nightly #1687

2014-09-18 Thread Apache Jenkins Server
See 



Jenkins build is back to stable : mahout-nightly » Mahout Spark bindings #1687

2014-09-18 Thread Apache Jenkins Server
See 




Re: rowsimilarity

2014-09-18 Thread Pat Ferrel
spark-rowsimilarity is implemented with LLR. It produces exactly what is shown 
below, it’s in the test case. It is not really made for textual doc similarity 
yet since as you say more is needed. For text it would be better to: 

1) run the docs through a lucene analyzer.
2) LLR to filter unneeded terms.
3) TF-IDF weight the terms remaining
4) Use cosine to determine similarity strengths. 

Which is what I believe you said. Eventually I’ll get to this. As-is it’s more 
like user similarity.


On Sep 18, 2014, at 11:15 AM, Ted Dunning  wrote:


LLR with text is commonly done (that is where it comes from).

The simple approach would be to have sentences be users and words be items.  
This will result in word-word connections.

This doesn't directly give document-document similarities.  That could be done 
by transposing the original data (word is user, document is item) but I don't 
quite understand how to interpret that.  Another approach is simply using term 
weighting and document normalization and scoring every doc against every other. 
 That comes down to a matrix multiplication which is very similar to the 
transposed LLR problem so that may give an interpretation.


On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel  wrote:
LLR with text or non-interaction data. What do we use for counts? Do we care 
how many times a token is seen in a doc for instance or do we just look to see 
if it was seen. I assume the later, which means we need a new 
numNonZeroElementsPerRow several places in math-scala, right?

All the same questions are going to come up over this as did for 
numNonZeroElementsPerColumn so please speak now or I’ll just mirror its 
implementation.


On Aug 25, 2014, at 9:38 AM, Pat Ferrel  wrote:

Turning itemsimilarity into rowsimilarity if fairly simple but requires 
altering CooccurrenceAnalysis.cooccurrence to swap the transposes and calculate 
the LLR values for rows rather than columns. The input will be something like a 
DRM. Row similarity does something like AA’ with LLR weighting and uses similar 
downsampling as I take it from the Hadoop code. Let me know if I’m on the wrong 
track here.

With the new application ID preserving code the following input could be 
directly processed (it’s my test case)

doc1\tNow is the time for all good people to come to aid of their party
doc2\tNow is the time for all good people to come to aid of their country
doc3\tNow is the time for all good people to come to aid of their hood
doc4\tNow is the time for all good people to come to aid of their friends
doc5\tNow is the time for all good people to come to aid of their looser brother
doc6\tThe quick brown fox jumped over the lazy dog
doc7\tThe quick brown fox jumped over the lazy boy
doc8\tThe quick brown fox jumped over the lazy cat
doc9\tThe quick brown fox jumped over the lazy wolverine
doc10\tThe quick brown fox jumped over the lazy cantelope

The output will be something like the following, with or without LLR strengths.
doc1\tdoc2 doc3 doc4 doc5
…
doc6\tdoc7 doc8 doc9 doc10
...

It would be pretty easy to tack on a text analyzer from lucene to turn this 
into a full function doc similarity job since LLR doesn’t need TF-IDF.

One question is: is there any reason to do the cross-similarity in RSJ, so 
[AB’]? I can’t picture what this would mean so am assuming the answer is no.






RE: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Saikat Kanjilal
Ok great I'll use the cartesian spark API call, so what I'd still like some 
thoughts on where the code that calls the cartesian should live in our 
directory structure.
> Date: Thu, 18 Sep 2014 15:33:59 -0400
> From: squ...@gatech.edu
> To: dev@mahout.apache.org
> Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
> shapes
> 
> Saikat,
> 
> Spark has the cartesian() method that will align all pairs of points; 
> that's the nontrivial part of determining an RBF kernel. After that it's 
> a simple matter of performing the equation that's given on the 
> scikit-learn doc page.
> 
> However, like you said it'll also have to be implemented using the 
> Mahout DSL. I can envision that users would like to compute pairwise 
> metrics for a lot more than just RBF kernels (pairwise Euclidean 
> distance, etc), so my guess would be a DSL implementation of cartesian() 
> is what you're looking for. You can build the other methods on top of that.
> 
> Correct me if I'm wrong.
> 
> Shannon
> 
> On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:
> > http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
> > I need to implement the above in the scala world and expose a DSL API to 
> > call the computation when computing the affinity matrix.
> >
> >> From: ted.dunn...@gmail.com
> >> Date: Thu, 18 Sep 2014 10:04:34 -0700
> >> Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays 
> >> of shapes
> >> To: dev@mahout.apache.org
> >>
> >> There are number of non-traditional linear algebra operations like this
> >> that are important to implement.
> >>
> >> Can you describe what you intend to do so that we can discuss the shape of
> >> the API and computation?
> >>
> >>
> >>
> >> On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
> >> wrote:
> >>
> >>> Dmitry et al,As part of the above JIRA I need to calculate the gaussian
> >>> kernel between 2 shapes, I looked through mahout-math-scala and didnt see
> >>> anything to do this, any objections to me adding some code under
> >>> scalabindings to do this?
> >>> Thanks in advance.
> > 
> 
  

Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Shannon Quinn

Saikat,

Spark has the cartesian() method that will align all pairs of points; 
that's the nontrivial part of determining an RBF kernel. After that it's 
a simple matter of performing the equation that's given on the 
scikit-learn doc page.


However, like you said it'll also have to be implemented using the 
Mahout DSL. I can envision that users would like to compute pairwise 
metrics for a lot more than just RBF kernels (pairwise Euclidean 
distance, etc), so my guess would be a DSL implementation of cartesian() 
is what you're looking for. You can build the other methods on top of that.


Correct me if I'm wrong.

Shannon

On 9/18/14, 3:28 PM, Saikat Kanjilal wrote:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
I need to implement the above in the scala world and expose a DSL API to call 
the computation when computing the affinity matrix.


From: ted.dunn...@gmail.com
Date: Thu, 18 Sep 2014 10:04:34 -0700
Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
shapes
To: dev@mahout.apache.org

There are number of non-traditional linear algebra operations like this
that are important to implement.

Can you describe what you intend to do so that we can discuss the shape of
the API and computation?



On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
wrote:


Dmitry et al,As part of the above JIRA I need to calculate the gaussian
kernel between 2 shapes, I looked through mahout-math-scala and didnt see
anything to do this, any objections to me adding some code under
scalabindings to do this?
Thanks in advance.






RE: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Saikat Kanjilal
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html
I need to implement the above in the scala world and expose a DSL API to call 
the computation when computing the affinity matrix.

> From: ted.dunn...@gmail.com
> Date: Thu, 18 Sep 2014 10:04:34 -0700
> Subject: Re: Mahout-1539-computation of gaussian kernel between 2 arrays of 
> shapes
> To: dev@mahout.apache.org
> 
> There are number of non-traditional linear algebra operations like this
> that are important to implement.
> 
> Can you describe what you intend to do so that we can discuss the shape of
> the API and computation?
> 
> 
> 
> On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
> wrote:
> 
> > Dmitry et al,As part of the above JIRA I need to calculate the gaussian
> > kernel between 2 shapes, I looked through mahout-math-scala and didnt see
> > anything to do this, any objections to me adding some code under
> > scalabindings to do this?
> > Thanks in advance.
  

Re: rowsimilarity

2014-09-18 Thread Ted Dunning
LLR with text is commonly done (that is where it comes from).

The simple approach would be to have sentences be users and words be items.
 This will result in word-word connections.

This doesn't directly give document-document similarities.  That could be
done by transposing the original data (word is user, document is item) but
I don't quite understand how to interpret that.  Another approach is simply
using term weighting and document normalization and scoring every doc
against every other.  That comes down to a matrix multiplication which is
very similar to the transposed LLR problem so that may give an
interpretation.


On Mon, Aug 25, 2014 at 10:15 AM, Pat Ferrel  wrote:

> LLR with text or non-interaction data. What do we use for counts? Do we
> care how many times a token is seen in a doc for instance or do we just
> look to see if it was seen. I assume the later, which means we need a new
> numNonZeroElementsPerRow several places in math-scala, right?
>
> All the same questions are going to come up over this as did for
> numNonZeroElementsPerColumn so please speak now or I’ll just mirror its
> implementation.
>
>
> On Aug 25, 2014, at 9:38 AM, Pat Ferrel  wrote:
>
> Turning itemsimilarity into rowsimilarity if fairly simple but requires
> altering CooccurrenceAnalysis.cooccurrence to swap the transposes and
> calculate the LLR values for rows rather than columns. The input will be
> something like a DRM. Row similarity does something like AA’ with LLR
> weighting and uses similar downsampling as I take it from the Hadoop code.
> Let me know if I’m on the wrong track here.
>
> With the new application ID preserving code the following input could be
> directly processed (it’s my test case)
>
> doc1\tNow is the time for all good people to come to aid of their party
> doc2\tNow is the time for all good people to come to aid of their country
> doc3\tNow is the time for all good people to come to aid of their hood
> doc4\tNow is the time for all good people to come to aid of their friends
> doc5\tNow is the time for all good people to come to aid of their looser
> brother
> doc6\tThe quick brown fox jumped over the lazy dog
> doc7\tThe quick brown fox jumped over the lazy boy
> doc8\tThe quick brown fox jumped over the lazy cat
> doc9\tThe quick brown fox jumped over the lazy wolverine
> doc10\tThe quick brown fox jumped over the lazy cantelope
>
> The output will be something like the following, with or without LLR
> strengths.
> doc1\tdoc2 doc3 doc4 doc5
> …
> doc6\tdoc7 doc8 doc9 doc10
> ...
>
> It would be pretty easy to tack on a text analyzer from lucene to turn
> this into a full function doc similarity job since LLR doesn’t need TF-IDF.
>
> One question is: is there any reason to do the cross-similarity in RSJ, so
> [AB’]? I can’t picture what this would mean so am assuming the answer is no.
>
>
>


Re: Mahout-1539-computation of gaussian kernel between 2 arrays of shapes

2014-09-18 Thread Ted Dunning
There are number of non-traditional linear algebra operations like this
that are important to implement.

Can you describe what you intend to do so that we can discuss the shape of
the API and computation?



On Wed, Sep 17, 2014 at 9:28 PM, Saikat Kanjilal 
wrote:

> Dmitry et al,As part of the above JIRA I need to calculate the gaussian
> kernel between 2 shapes, I looked through mahout-math-scala and didnt see
> anything to do this, any objections to me adding some code under
> scalabindings to do this?
> Thanks in advance.