date:20130408

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-08 Thread Sean Owen

PS I think the issue is really more like this, after some more testing.

When lambda (overfitting parameter) is high, the X and Y in the
factorization A = X*Y' are forced to have a small (frobenius) norm.
They underfit A, potentially a lot, if lambda is high; the values of A
are always small and can't easily reach 1 where the original input was
1.

Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ],
and you're roughly solving A_u = X_u * Y' , for X_u. But the only way
to actually get a row like that, with even one 1, given how small Y
is, is to have a very large X_u.

The simple fold-in doesn't have a concept of the loss function and (by
design) over-states the importance of the new data point, by
unilaterally trying to make the new element in A a 1. In the
presence of way-too-strong regularization, this over-statement becomes
a huge over-statement and it falls down.

Anyway -- long story short, a simple check on the inf norm of X' * X
or Y' * Y seems to suffice to decide that lambda is too big and go
complain about it rather than proceed.

On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote:
 All that said I don't think inverting is the issue here. Using the SVD
 to invert didn't change things, and neither did actually solving the
 Ax=b problem instead of inverting A by using Householder reflections.

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-08 Thread Koobas

Okay, it sheds some light on the problem.
Thanks for sharing.


On Mon, Apr 8, 2013 at 4:33 AM, Sean Owen sro...@gmail.com wrote:

 PS I think the issue is really more like this, after some more testing.

 When lambda (overfitting parameter) is high, the X and Y in the
 factorization A = X*Y' are forced to have a small (frobenius) norm.
 They underfit A, potentially a lot, if lambda is high; the values of A
 are always small and can't easily reach 1 where the original input was
 1.

 Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ],
 and you're roughly solving A_u = X_u * Y' , for X_u. But the only way
 to actually get a row like that, with even one 1, given how small Y
 is, is to have a very large X_u.

 The simple fold-in doesn't have a concept of the loss function and (by
 design) over-states the importance of the new data point, by
 unilaterally trying to make the new element in A a 1. In the
 presence of way-too-strong regularization, this over-statement becomes
 a huge over-statement and it falls down.

 Anyway -- long story short, a simple check on the inf norm of X' * X
 or Y' * Y seems to suffice to decide that lambda is too big and go
 complain about it rather than proceed.

 On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote:
  All that said I don't think inverting is the issue here. Using the SVD
  to invert didn't change things, and neither did actually solving the
  Ax=b problem instead of inverting A by using Householder reflections.

Re: Integrating Mahout with existing nlp libraries

2013-04-08 Thread Ted Dunning

This sounds like the best suggestion so far.

On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote:

This is typically what Behemoth can be used for
https://github.com/DigitalPebble/behemoth. It has a Mahout module to
generate vectors at the same format as SparseVectorsFromSequenceFiles.
Assuming
that the document similarity job itself can run on the same input as the
clustering then you'd be able to use that in combination with the other
Behemoth modules e.g. import the documents, parse with Tika, tokenize, do
some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR
etc...

Julien
*
*
*
*

On 3 April 2013 16:28, Sebastian Schelter ssc.o...@googlemail.com wrote:

Thinking loud here: It would be great to have a DocumentSimilarityJob
that is supplied a collection of documents and then applies necessary
preprocessing (tokenization, vectorization, etc) and computes document
similarities.

Could be a nice starter task to add something like this.

On 03.04.2013 17:09, Suneel Marthi wrote:
Akshay,

If you are trying to determine document similarity using MapReduce,
Mahout's RowSimiliarity may be useful here.

Have a look at the following thread:-

http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results

I had tried this on a corpus of 2 million web sites and had good results.

Let us know if this works for u.

From: akshay bhatt akshay22bh...@gmail.com
To: user@mahout.apache.org
Sent: Wednesday, April 3, 2013 5:36 AM
Subject: Integrating Mahout with existing nlp libraries

I tried searching for it here and there, but could not find any good
solution,
so though of asking nlp experts. I am developing an text similarity
finding
application for which I need to match thousands and thousands of
documents (of
around 1000 words each) with each other. For nlp part, my best bet is
NLTK
(seeing its capabilities and algorithm friendlyness of python.But now
when parts
of speech tagging in itself taking so much of time, I believe, nltk may
not be
best suitable. Java or C won't hurt me, hence any solution will work for
me.
Please note, I have already started migrating from mysql to hbase in
order to
work with more freedom on such large number of data. But still question
exists,
how to perform algos. Mahout may be a choice, but that too is for machine
learning, not dedicated for nlp (may be good for speech recognition).
What else
are available options. In gist, I need high performance nlp, (a step
down from
high performance machine learning). (I am inclined a bit towards Mahout,
seeing
future usage).

(already asked at -

http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives
)
.

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread Ted Dunning

I don't see the problem here.  We only want to compare two items so Jaccard and 
Tanimoto are identical.

Could you file a JIRA and suggest a javadoc patch?

Why did this take you to an ancient journal instead of Wikipedia?


On Apr 7, 2013, at 6:54 AM, James Endicott wrote:

 As far as I can tell, the difference between the two is that the Jaccard
 similarity can only be used to compare two items using the formula:
 items appearing in both documents/(items just appearing in one + items just
 appearing in the other + items appearing in both)
 But the Tanimoto similarity measure allows for comparing between any number
 of items by generalizing the formula to:
 items appearing in all documents/(items just appearing in one + items just
 appearing in another + ... + items appearing in some but not all + ... +
 items appearing in all)
 
 I think the class could be generalized to implement the full Tanimoto
 similarity without too much difficulty (though I don't think it's a high
 priority) but at the moment it does not do so. While I realize this is
 probably a trivial matter, I hope the docs get updated at some point so
 another grad student doesn't have to muddle through a botany article in a
 Swiss journal from 1901 again.

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread James Endicott

I didn't want to file a suggestion for a javadoc patch without hearing from
someone who knows a bit more about the math history behind it because I
didn't want to suggest something that may be in error. When I checked the
Wikipedia article on it, the article noted that there was confusion an
inconsistency between papers as to what Tanimoto actually was and how it
compared to Jaccard. So, I went to the primary source for Jaccard and am
getting the primary source for Tanimoto when/if interlibrary loan comes
through.


On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I don't see the problem here.  We only want to compare two items so
 Jaccard and Tanimoto are identical.

 Could you file a JIRA and suggest a javadoc patch?

 Why did this take you to an ancient journal instead of Wikipedia?


 On Apr 7, 2013, at 6:54 AM, James Endicott wrote:

  As far as I can tell, the difference between the two is that the Jaccard
  similarity can only be used to compare two items using the formula:
  items appearing in both documents/(items just appearing in one + items
 just
  appearing in the other + items appearing in both)
  But the Tanimoto similarity measure allows for comparing between any
 number
  of items by generalizing the formula to:
  items appearing in all documents/(items just appearing in one + items
 just
  appearing in another + ... + items appearing in some but not all + ... +
  items appearing in all)
 
  I think the class could be generalized to implement the full Tanimoto
  similarity without too much difficulty (though I don't think it's a high
  priority) but at the moment it does not do so. While I realize this is
  probably a trivial matter, I hope the docs get updated at some point so
  another grad student doesn't have to muddle through a botany article in a
  Swiss journal from 1901 again.

In-memory kmeans clustering

2013-04-08 Thread Ahmet Ylmaz

Hi,

It seems to be that in-memory kmeans clustering is removed from Mahout 0.7.

Does this mean that it is no longer possible to do in-memory kmeans clustering 
with Mahout?
Or, is Hadoop based kmeans clustering the only option?


Thanks
Ahmet

Re: cross recommender

2013-04-08 Thread Ted Dunning

On Sat, Apr 6, 2013 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I guess I don't understand this issue.

 In my case both the item ids and user ids of the separate DistributedRow
 Matrix will match and I know the size for the entire space from a previous
 step where I create id maps. I suppose you are saying the the m/r code
 would be super simple if a row of B' and a  column of A could be processed
 together, which I understand as an optimal implementation.


Well rows of B and A should match so columns of B' and rows of A rather
than the reverse.


 So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem
 to work. You loose the ability to substutute different RowSimilarityJob
 measures. I assume this creates something like the co-occurrence similairty
 measure. But oh, well. Maybe I'll look at that later.


Yes.  Exactly.


 I also see why you say the two matrices A and B don't have to have the
 same size since [B'A]H_v = [B'A]A' so the dimensions will work out as long
 as the users dimension is the same throughout.


Yes.  All we need is user id match.

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread Ted Dunning

To my mind, you as the reader have a major voice here.

So if you were confused/not happy with the doc, then there is a problem.
 You will know best how to fix that when you get done.

So let us know how!



On Mon, Apr 8, 2013 at 2:16 PM, James Endicott endicott.ja...@gmail.comwrote:

 I didn't want to file a suggestion for a javadoc patch without hearing from
 someone who knows a bit more about the math history behind it because I
 didn't want to suggest something that may be in error. When I checked the
 Wikipedia article on it, the article noted that there was confusion an
 inconsistency between papers as to what Tanimoto actually was and how it
 compared to Jaccard. So, I went to the primary source for Jaccard and am
 getting the primary source for Tanimoto when/if interlibrary loan comes
 through.


 On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  I don't see the problem here.  We only want to compare two items so
  Jaccard and Tanimoto are identical.
 
  Could you file a JIRA and suggest a javadoc patch?
 
  Why did this take you to an ancient journal instead of Wikipedia?
 
 
  On Apr 7, 2013, at 6:54 AM, James Endicott wrote:
 
   As far as I can tell, the difference between the two is that the
 Jaccard
   similarity can only be used to compare two items using the formula:
   items appearing in both documents/(items just appearing in one + items
  just
   appearing in the other + items appearing in both)
   But the Tanimoto similarity measure allows for comparing between any
  number
   of items by generalizing the formula to:
   items appearing in all documents/(items just appearing in one + items
  just
   appearing in another + ... + items appearing in some but not all + ...
 +
   items appearing in all)
  
   I think the class could be generalized to implement the full Tanimoto
   similarity without too much difficulty (though I don't think it's a
 high
   priority) but at the moment it does not do so. While I realize this is
   probably a trivial matter, I hope the docs get updated at some point so
   another grad student doesn't have to muddle through a botany article
 in a
   Swiss journal from 1901 again.

Re: Detecting rank-deficiency, or worse, via QR decomposition

Re: Detecting rank-deficiency, or worse, via QR decomposition

Re: Integrating Mahout with existing nlp libraries

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

In-memory kmeans clustering

Re: cross recommender

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

8 matches

Site Navigation

Mail list logo

Footer information