Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-08 Thread Sean Owen
PS I think the issue is really more like this, after some more testing.

When lambda (overfitting parameter) is high, the X and Y in the
factorization A = X*Y' are forced to have a small (frobenius) norm.
They underfit A, potentially a lot, if lambda is high; the values of A
are always small and can't easily reach 1 where the original input was
1.

Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ],
and you're roughly solving A_u = X_u * Y' , for X_u. But the only way
to actually get a row like that, with even one 1, given how small Y
is, is to have a very large X_u.

The simple fold-in doesn't have a concept of the loss function and (by
design) over-states the importance of the new data point, by
unilaterally trying to make the new element in A a 1. In the
presence of way-too-strong regularization, this over-statement becomes
a huge over-statement and it falls down.

Anyway -- long story short, a simple check on the inf norm of X' * X
or Y' * Y seems to suffice to decide that lambda is too big and go
complain about it rather than proceed.

On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote:
 All that said I don't think inverting is the issue here. Using the SVD
 to invert didn't change things, and neither did actually solving the
 Ax=b problem instead of inverting A by using Householder reflections.


Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-08 Thread Koobas
Okay, it sheds some light on the problem.
Thanks for sharing.


On Mon, Apr 8, 2013 at 4:33 AM, Sean Owen sro...@gmail.com wrote:

 PS I think the issue is really more like this, after some more testing.

 When lambda (overfitting parameter) is high, the X and Y in the
 factorization A = X*Y' are forced to have a small (frobenius) norm.
 They underfit A, potentially a lot, if lambda is high; the values of A
 are always small and can't easily reach 1 where the original input was
 1.

 Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ],
 and you're roughly solving A_u = X_u * Y' , for X_u. But the only way
 to actually get a row like that, with even one 1, given how small Y
 is, is to have a very large X_u.

 The simple fold-in doesn't have a concept of the loss function and (by
 design) over-states the importance of the new data point, by
 unilaterally trying to make the new element in A a 1. In the
 presence of way-too-strong regularization, this over-statement becomes
 a huge over-statement and it falls down.

 Anyway -- long story short, a simple check on the inf norm of X' * X
 or Y' * Y seems to suffice to decide that lambda is too big and go
 complain about it rather than proceed.

 On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote:
  All that said I don't think inverting is the issue here. Using the SVD
  to invert didn't change things, and neither did actually solving the
  Ax=b problem instead of inverting A by using Householder reflections.



Re: Integrating Mahout with existing nlp libraries

2013-04-08 Thread Ted Dunning
This sounds like the best suggestion so far.

On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote:

 This is typically what Behemoth can be used for
 https://github.com/DigitalPebble/behemoth. It has a Mahout module to
 generate vectors at the same format as SparseVectorsFromSequenceFiles. 
 Assuming
 that the document similarity job itself can run on the same input as the
 clustering then you'd be able to use that in combination with the other
 Behemoth modules e.g. import the documents, parse with Tika, tokenize, do
 some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR
 etc...
 
 Julien
 *
 *
 *
 *
 
 
 
 On 3 April 2013 16:28, Sebastian Schelter ssc.o...@googlemail.com wrote:
 
 Thinking loud here: It would be great to have a DocumentSimilarityJob
 that is supplied a collection of documents and then applies necessary
 preprocessing (tokenization, vectorization, etc) and computes document
 similarities.
 
 Could be a nice starter task to add something like this.
 
 On 03.04.2013 17:09, Suneel Marthi wrote:
 Akshay,
 
 If you are trying to determine document similarity using MapReduce,
 Mahout's RowSimiliarity may be useful here.
 
 Have a look at the following thread:-
 
 
 http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
 
 
 I had tried this on a corpus of 2 million web sites and had good results.
 
 Let us know if this works for u.
 
 
 
 
 From: akshay bhatt akshay22bh...@gmail.com
 To: user@mahout.apache.org
 Sent: Wednesday, April 3, 2013 5:36 AM
 Subject: Integrating Mahout with existing nlp libraries
 
 I tried searching for it here and there, but could not find any good
 solution,
 so though of asking nlp experts. I am developing an text similarity
 finding
 application for which I need to match thousands and thousands of
 documents (of
 around 1000 words each) with each other. For nlp part, my best bet is
 NLTK
 (seeing its capabilities and algorithm friendlyness of python.But now
 when parts
 of speech tagging in itself taking so much of time, I believe, nltk may
 not be
 best suitable. Java or C won't hurt me, hence any solution will work for
 me.
 Please note, I have already started migrating from mysql to hbase in
 order to
 work with more freedom on such large number of data. But still question
 exists,
 how to perform algos. Mahout may be a choice, but that too is for machine
 learning, not dedicated for nlp (may be good for speech recognition).
 What else
 are available options. In gist, I need high performance nlp, (a step
 down from
 high performance machine learning). (I am inclined a bit towards Mahout,
 seeing
 future usage).
 
 (already asked at -
 
 http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives
 )
 .
 
 
 
 
 
 -- 
 *
 *Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread Ted Dunning
I don't see the problem here.  We only want to compare two items so Jaccard and 
Tanimoto are identical.

Could you file a JIRA and suggest a javadoc patch?

Why did this take you to an ancient journal instead of Wikipedia?


On Apr 7, 2013, at 6:54 AM, James Endicott wrote:

 As far as I can tell, the difference between the two is that the Jaccard
 similarity can only be used to compare two items using the formula:
 items appearing in both documents/(items just appearing in one + items just
 appearing in the other + items appearing in both)
 But the Tanimoto similarity measure allows for comparing between any number
 of items by generalizing the formula to:
 items appearing in all documents/(items just appearing in one + items just
 appearing in another + ... + items appearing in some but not all + ... +
 items appearing in all)
 
 I think the class could be generalized to implement the full Tanimoto
 similarity without too much difficulty (though I don't think it's a high
 priority) but at the moment it does not do so. While I realize this is
 probably a trivial matter, I hope the docs get updated at some point so
 another grad student doesn't have to muddle through a botany article in a
 Swiss journal from 1901 again.



Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread James Endicott
I didn't want to file a suggestion for a javadoc patch without hearing from
someone who knows a bit more about the math history behind it because I
didn't want to suggest something that may be in error. When I checked the
Wikipedia article on it, the article noted that there was confusion an
inconsistency between papers as to what Tanimoto actually was and how it
compared to Jaccard. So, I went to the primary source for Jaccard and am
getting the primary source for Tanimoto when/if interlibrary loan comes
through.


On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I don't see the problem here.  We only want to compare two items so
 Jaccard and Tanimoto are identical.

 Could you file a JIRA and suggest a javadoc patch?

 Why did this take you to an ancient journal instead of Wikipedia?


 On Apr 7, 2013, at 6:54 AM, James Endicott wrote:

  As far as I can tell, the difference between the two is that the Jaccard
  similarity can only be used to compare two items using the formula:
  items appearing in both documents/(items just appearing in one + items
 just
  appearing in the other + items appearing in both)
  But the Tanimoto similarity measure allows for comparing between any
 number
  of items by generalizing the formula to:
  items appearing in all documents/(items just appearing in one + items
 just
  appearing in another + ... + items appearing in some but not all + ... +
  items appearing in all)
 
  I think the class could be generalized to implement the full Tanimoto
  similarity without too much difficulty (though I don't think it's a high
  priority) but at the moment it does not do so. While I realize this is
  probably a trivial matter, I hope the docs get updated at some point so
  another grad student doesn't have to muddle through a botany article in a
  Swiss journal from 1901 again.




In-memory kmeans clustering

2013-04-08 Thread Ahmet Ylmaz
Hi,

It seems to be that in-memory kmeans clustering is removed from Mahout 0.7.

Does this mean that it is no longer possible to do in-memory kmeans clustering 
with Mahout?
Or, is Hadoop based kmeans clustering the only option?


Thanks
Ahmet


Re: cross recommender

2013-04-08 Thread Ted Dunning
On Sat, Apr 6, 2013 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I guess I don't understand this issue.

 In my case both the item ids and user ids of the separate DistributedRow
 Matrix will match and I know the size for the entire space from a previous
 step where I create id maps. I suppose you are saying the the m/r code
 would be super simple if a row of B' and a  column of A could be processed
 together, which I understand as an optimal implementation.


Well rows of B and A should match so columns of B' and rows of A rather
than the reverse.


 So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem
 to work. You loose the ability to substutute different RowSimilarityJob
 measures. I assume this creates something like the co-occurrence similairty
 measure. But oh, well. Maybe I'll look at that later.


Yes.  Exactly.


 I also see why you say the two matrices A and B don't have to have the
 same size since [B'A]H_v = [B'A]A' so the dimensions will work out as long
 as the users dimension is the same throughout.


Yes.  All we need is user id match.


Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-08 Thread Ted Dunning
To my mind, you as the reader have a major voice here.

So if you were confused/not happy with the doc, then there is a problem.
 You will know best how to fix that when you get done.

So let us know how!



On Mon, Apr 8, 2013 at 2:16 PM, James Endicott endicott.ja...@gmail.comwrote:

 I didn't want to file a suggestion for a javadoc patch without hearing from
 someone who knows a bit more about the math history behind it because I
 didn't want to suggest something that may be in error. When I checked the
 Wikipedia article on it, the article noted that there was confusion an
 inconsistency between papers as to what Tanimoto actually was and how it
 compared to Jaccard. So, I went to the primary source for Jaccard and am
 getting the primary source for Tanimoto when/if interlibrary loan comes
 through.


 On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  I don't see the problem here.  We only want to compare two items so
  Jaccard and Tanimoto are identical.
 
  Could you file a JIRA and suggest a javadoc patch?
 
  Why did this take you to an ancient journal instead of Wikipedia?
 
 
  On Apr 7, 2013, at 6:54 AM, James Endicott wrote:
 
   As far as I can tell, the difference between the two is that the
 Jaccard
   similarity can only be used to compare two items using the formula:
   items appearing in both documents/(items just appearing in one + items
  just
   appearing in the other + items appearing in both)
   But the Tanimoto similarity measure allows for comparing between any
  number
   of items by generalizing the formula to:
   items appearing in all documents/(items just appearing in one + items
  just
   appearing in another + ... + items appearing in some but not all + ...
 +
   items appearing in all)
  
   I think the class could be generalized to implement the full Tanimoto
   similarity without too much difficulty (though I don't think it's a
 high
   priority) but at the moment it does not do so. While I realize this is
   probably a trivial matter, I hope the docs get updated at some point so
   another grad student doesn't have to muddle through a botany article
 in a
   Swiss journal from 1901 again.