Re: Detecting rank-deficiency, or worse, via QR decomposition
PS I think the issue is really more like this, after some more testing. When lambda (overfitting parameter) is high, the X and Y in the factorization A = X*Y' are forced to have a small (frobenius) norm. They underfit A, potentially a lot, if lambda is high; the values of A are always small and can't easily reach 1 where the original input was 1. Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ], and you're roughly solving A_u = X_u * Y' , for X_u. But the only way to actually get a row like that, with even one 1, given how small Y is, is to have a very large X_u. The simple fold-in doesn't have a concept of the loss function and (by design) over-states the importance of the new data point, by unilaterally trying to make the new element in A a 1. In the presence of way-too-strong regularization, this over-statement becomes a huge over-statement and it falls down. Anyway -- long story short, a simple check on the inf norm of X' * X or Y' * Y seems to suffice to decide that lambda is too big and go complain about it rather than proceed. On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote: All that said I don't think inverting is the issue here. Using the SVD to invert didn't change things, and neither did actually solving the Ax=b problem instead of inverting A by using Householder reflections.
Re: Detecting rank-deficiency, or worse, via QR decomposition
Okay, it sheds some light on the problem. Thanks for sharing. On Mon, Apr 8, 2013 at 4:33 AM, Sean Owen sro...@gmail.com wrote: PS I think the issue is really more like this, after some more testing. When lambda (overfitting parameter) is high, the X and Y in the factorization A = X*Y' are forced to have a small (frobenius) norm. They underfit A, potentially a lot, if lambda is high; the values of A are always small and can't easily reach 1 where the original input was 1. Later you get a new click, a new row A_u = [ 0 0 ... 0 1 0 ... 0 0 ], and you're roughly solving A_u = X_u * Y' , for X_u. But the only way to actually get a row like that, with even one 1, given how small Y is, is to have a very large X_u. The simple fold-in doesn't have a concept of the loss function and (by design) over-states the importance of the new data point, by unilaterally trying to make the new element in A a 1. In the presence of way-too-strong regularization, this over-statement becomes a huge over-statement and it falls down. Anyway -- long story short, a simple check on the inf norm of X' * X or Y' * Y seems to suffice to decide that lambda is too big and go complain about it rather than proceed. On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote: All that said I don't think inverting is the issue here. Using the SVD to invert didn't change things, and neither did actually solving the Ax=b problem instead of inverting A by using Householder reflections.
Re: Integrating Mahout with existing nlp libraries
This sounds like the best suggestion so far. On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote: This is typically what Behemoth can be used for https://github.com/DigitalPebble/behemoth. It has a Mahout module to generate vectors at the same format as SparseVectorsFromSequenceFiles. Assuming that the document similarity job itself can run on the same input as the clustering then you'd be able to use that in combination with the other Behemoth modules e.g. import the documents, parse with Tika, tokenize, do some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR etc... Julien * * * * On 3 April 2013 16:28, Sebastian Schelter ssc.o...@googlemail.com wrote: Thinking loud here: It would be great to have a DocumentSimilarityJob that is supplied a collection of documents and then applies necessary preprocessing (tokenization, vectorization, etc) and computes document similarities. Could be a nice starter task to add something like this. On 03.04.2013 17:09, Suneel Marthi wrote: Akshay, If you are trying to determine document similarity using MapReduce, Mahout's RowSimiliarity may be useful here. Have a look at the following thread:- http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results I had tried this on a corpus of 2 million web sites and had good results. Let us know if this works for u. From: akshay bhatt akshay22bh...@gmail.com To: user@mahout.apache.org Sent: Wednesday, April 3, 2013 5:36 AM Subject: Integrating Mahout with existing nlp libraries I tried searching for it here and there, but could not find any good solution, so though of asking nlp experts. I am developing an text similarity finding application for which I need to match thousands and thousands of documents (of around 1000 words each) with each other. For nlp part, my best bet is NLTK (seeing its capabilities and algorithm friendlyness of python.But now when parts of speech tagging in itself taking so much of time, I believe, nltk may not be best suitable. Java or C won't hurt me, hence any solution will work for me. Please note, I have already started migrating from mysql to hbase in order to work with more freedom on such large number of data. But still question exists, how to perform algos. Mahout may be a choice, but that too is for machine learning, not dedicated for nlp (may be good for speech recognition). What else are available options. In gist, I need high performance nlp, (a step down from high performance machine learning). (I am inclined a bit towards Mahout, seeing future usage). (already asked at - http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives ) . -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure
I don't see the problem here. We only want to compare two items so Jaccard and Tanimoto are identical. Could you file a JIRA and suggest a javadoc patch? Why did this take you to an ancient journal instead of Wikipedia? On Apr 7, 2013, at 6:54 AM, James Endicott wrote: As far as I can tell, the difference between the two is that the Jaccard similarity can only be used to compare two items using the formula: items appearing in both documents/(items just appearing in one + items just appearing in the other + items appearing in both) But the Tanimoto similarity measure allows for comparing between any number of items by generalizing the formula to: items appearing in all documents/(items just appearing in one + items just appearing in another + ... + items appearing in some but not all + ... + items appearing in all) I think the class could be generalized to implement the full Tanimoto similarity without too much difficulty (though I don't think it's a high priority) but at the moment it does not do so. While I realize this is probably a trivial matter, I hope the docs get updated at some point so another grad student doesn't have to muddle through a botany article in a Swiss journal from 1901 again.
Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure
I didn't want to file a suggestion for a javadoc patch without hearing from someone who knows a bit more about the math history behind it because I didn't want to suggest something that may be in error. When I checked the Wikipedia article on it, the article noted that there was confusion an inconsistency between papers as to what Tanimoto actually was and how it compared to Jaccard. So, I went to the primary source for Jaccard and am getting the primary source for Tanimoto when/if interlibrary loan comes through. On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: I don't see the problem here. We only want to compare two items so Jaccard and Tanimoto are identical. Could you file a JIRA and suggest a javadoc patch? Why did this take you to an ancient journal instead of Wikipedia? On Apr 7, 2013, at 6:54 AM, James Endicott wrote: As far as I can tell, the difference between the two is that the Jaccard similarity can only be used to compare two items using the formula: items appearing in both documents/(items just appearing in one + items just appearing in the other + items appearing in both) But the Tanimoto similarity measure allows for comparing between any number of items by generalizing the formula to: items appearing in all documents/(items just appearing in one + items just appearing in another + ... + items appearing in some but not all + ... + items appearing in all) I think the class could be generalized to implement the full Tanimoto similarity without too much difficulty (though I don't think it's a high priority) but at the moment it does not do so. While I realize this is probably a trivial matter, I hope the docs get updated at some point so another grad student doesn't have to muddle through a botany article in a Swiss journal from 1901 again.
In-memory kmeans clustering
Hi, It seems to be that in-memory kmeans clustering is removed from Mahout 0.7. Does this mean that it is no longer possible to do in-memory kmeans clustering with Mahout? Or, is Hadoop based kmeans clustering the only option? Thanks Ahmet
Re: cross recommender
On Sat, Apr 6, 2013 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote: I guess I don't understand this issue. In my case both the item ids and user ids of the separate DistributedRow Matrix will match and I know the size for the entire space from a previous step where I create id maps. I suppose you are saying the the m/r code would be super simple if a row of B' and a column of A could be processed together, which I understand as an optimal implementation. Well rows of B and A should match so columns of B' and rows of A rather than the reverse. So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem to work. You loose the ability to substutute different RowSimilarityJob measures. I assume this creates something like the co-occurrence similairty measure. But oh, well. Maybe I'll look at that later. Yes. Exactly. I also see why you say the two matrices A and B don't have to have the same size since [B'A]H_v = [B'A]A' so the dimensions will work out as long as the users dimension is the same throughout. Yes. All we need is user id match.
Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure
To my mind, you as the reader have a major voice here. So if you were confused/not happy with the doc, then there is a problem. You will know best how to fix that when you get done. So let us know how! On Mon, Apr 8, 2013 at 2:16 PM, James Endicott endicott.ja...@gmail.comwrote: I didn't want to file a suggestion for a javadoc patch without hearing from someone who knows a bit more about the math history behind it because I didn't want to suggest something that may be in error. When I checked the Wikipedia article on it, the article noted that there was confusion an inconsistency between papers as to what Tanimoto actually was and how it compared to Jaccard. So, I went to the primary source for Jaccard and am getting the primary source for Tanimoto when/if interlibrary loan comes through. On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: I don't see the problem here. We only want to compare two items so Jaccard and Tanimoto are identical. Could you file a JIRA and suggest a javadoc patch? Why did this take you to an ancient journal instead of Wikipedia? On Apr 7, 2013, at 6:54 AM, James Endicott wrote: As far as I can tell, the difference between the two is that the Jaccard similarity can only be used to compare two items using the formula: items appearing in both documents/(items just appearing in one + items just appearing in the other + items appearing in both) But the Tanimoto similarity measure allows for comparing between any number of items by generalizing the formula to: items appearing in all documents/(items just appearing in one + items just appearing in another + ... + items appearing in some but not all + ... + items appearing in all) I think the class could be generalized to implement the full Tanimoto similarity without too much difficulty (though I don't think it's a high priority) but at the moment it does not do so. While I realize this is probably a trivial matter, I hope the docs get updated at some point so another grad student doesn't have to muddle through a botany article in a Swiss journal from 1901 again.