Re: Is any more detailed documentation aout the sgd logistic regression example.

2011-04-13 Thread Lance Norskog
Woohoo! On Wed, Apr 13, 2011 at 7:15 AM, Eric Charles wrote: > Can't wait for that :) > Just bought PDF. > Tks, > - Eric > > On 13/04/2011 06:57, Ted Dunning wrote: >> >> Yes.  That's the one. >> >> The hard copy should be out before long.  The final passes by the >> production >> editors are hap

Re: Identify "less similar" documents

2011-04-13 Thread Ted Dunning
I think that our estimation of whether this would work differs a bit. In the very high dimensional space that we are working in, proximities can be a bit surprising. For one thing, the bias term provides a mechanism so that a logistic regression can attribute score to an other category. This all

Re: Identify "less similar" documents

2011-04-13 Thread Daniel McEnnis
The official solution is to assign outliers in the training set to other. These are defined as high mean distance to other points. A hack to get this to work would be to perform a knn-like distance comparison with all trained sets and classify as other anything that exceeds the threshold distance

Re: Identify "less similar" documents

2011-04-13 Thread Dmitriy Lyubimov
I suspect but of the problem might be creating the training set for the 'other' since the documents are distinctly 'different' from anything else, including from each other. I guess the definition for the 'other' category is a 'low relevance for everything yet trained' but not 'high relevance to so

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Ted Dunning
I filed https://issues.apache.org/jira/browse/MAHOUT-669 for this. Anybody who would like to should please file a patch to fix one or more scripts. On Wed, Apr 13, 2011 at 9:34 AM, Ken Williams wrote: > Ted Dunning gmail.com> writes: > > > > > This may be a bit of regression. > > Thanks for th

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Ted Dunning
Very good idea. On Wed, Apr 13, 2011 at 9:49 AM, Frank Scholten wrote: > This sh error also occurred for the reuters script but has been fixed. > Maybe good to update all scripts to bash? > > On Apr 13, 2011, at 18:34, Ken Williams wrote: > > > Ted Dunning gmail.com> writes: > > > >> > >> This

Re: Identify "less similar" documents

2011-04-13 Thread Ted Dunning
On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco wrote: > Thanks for the help :) > > Why not just train with those documents and put a category tag of "other" > on > >them and run normal categorization? If you can distinguish these > documents > >by word frequencies, then this should do the trick.

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Frank Scholten
This sh error also occurred for the reuters script but has been fixed. Maybe good to update all scripts to bash? On Apr 13, 2011, at 18:34, Ken Williams wrote: > Ted Dunning gmail.com> writes: > >> >> This may be a bit of regression. > > Thanks for the reply. > > Just out of interest, I al

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Ken Williams
Ted Dunning gmail.com> writes: > > This may be a bit of regression. Thanks for the reply. Just out of interest, I also reckon your 'build-cluster-syntheticcontrol.sh' script should be a bash script (#!/bin/bash) rather than a standard shell (#!/bin/sh) script. $ trunk/examples/bin/build-clu

Re: Identify "less similar" documents

2011-04-13 Thread Daniel McEnnis
Claudia, The term to look up is 'one class classifier'. Its built on this problem with a set of solutions pre-made. I don't know if anyone has put it in a general classifier before, but the theory is there. Daniel. On Wed, Apr 13, 2011 at 11:56 AM, Claudia Grieco wrote: > Thanks for the help

RE: Choosing appropriate values for T1 and T2 for canopy clustering

2011-04-13 Thread Jeff Eastman
The T2 value you select will determine the number of clusters you get. The T1 value determines how much points which are near to each cluster will influence it in the final centroid calculation. Your choice of distance measure will also have a big impact upon the outcome. If T2 is too small you

R: Identify "less similar" documents

2011-04-13 Thread Claudia Grieco
Thanks for the help :) > Why not just train with those documents and put a category tag of "other" on >them and run normal categorization? If you can distinguish these documents >by word frequencies, then this should do the trick. I don't know if this will help 1)I'm still not sure where to put th

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Ted Dunning
This may be a bit of regression. On Wed, Apr 13, 2011 at 4:48 AM, Ken Williams wrote: > I'm not sure what to try next. Any help would be very welcome. >

Re: Identify "less similar" documents

2011-04-13 Thread Ted Dunning
I think that what you are doing is inventing an "other" category and building a classifier for that category. Why not just train with those documents and put a category tag of "other" on them and run normal categorization? If you can distinguish these documents by word frequencies, then this shou

R: Identify "less similar" documents

2011-04-13 Thread Claudia Grieco
Let's see if this approach makes sense: I have the documents to classify on a Lucene index (Index A) and the training set in another Lucene index (Index B). With a VectorMapper I map Term-Frequency Vectors of Index A to Term-Frequency Vectors of Index B. In this way the transformed vectors have onl

Re: Is any more detailed documentation aout the sgd logistic regression example.

2011-04-13 Thread Eric Charles
Can't wait for that :) Just bought PDF. Tks, - Eric On 13/04/2011 06:57, Ted Dunning wrote: Yes. That's the one. The hard copy should be out before long. The final passes by the production editors are happening now. On Tue, Apr 12, 2011 at 9:19 PM, Eric Charleswrote: You were talking about

20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Ken Williams
Hi All, I'm having trouble getting the 20News-Groups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups, and https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html) example to run. I've downloaded the data and tried to train the Naive Bayes classifier but I ran the 'traincl

Re: How about a LSH recommender ?

2011-04-13 Thread Benson Margulies
It takes a truly gargantuan amount of data to justify map-reducing LSH. You can get very far with a plain single-machine implementation. On Wed, Apr 13, 2011 at 5:57 AM, Sebastian Schelter wrote: > They are using PLSI which we already tried to implement in > https://issues.apache.org/jira/browse/

Re: How about a LSH recommender ?

2011-04-13 Thread Sebastian Schelter
They are using PLSI which we already tried to implement in https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it scalable, as far as I remember the paper, they are doing a nasty trick when sending data to the reducers in a certain step so that they only have to load a certain porti

Identify "less similar" documents

2011-04-13 Thread Claudia Grieco
Hi guys, I'm using SGD to classify a set of documents but I have a problem: there are some documents that are not related to any of the categories and I want to be able to identify them and exclude them from the classification. My idea is to read the documents of the training set (that are current

Re: How about a LSH recommender ?

2011-04-13 Thread Sean Owen
One of the three approaches that they combine is latent semantic indexing -- that is what I was referring to. On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning wrote: > Sean, > > Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive > hashing)? > > (are you a victim of agressive err

Re: How about a LSH recommender ?

2011-04-13 Thread Miles Osborne
Not for recommenders, but we have worked on using LSH for spotting breaking news in Twitter. Our experience is that it works well when the points are actually close together, but you do need to tweak it (eg work-out the number of hash functions to use and the number of tables). There are also tec

Re: How about a LSH recommender ?

2011-04-13 Thread Jake Mannix
You can do LSH on real-valued vectors - the 1's and 0's are just the +/- signs of projections onto randomly chosen hyperplanes. Ullman's book is a great reference for this, and also goes over how to do all the parameter choosing. On Wed, Apr 13, 2011 at 12:43 AM, ke xie wrote: > Ok, I would try

Re: How about a LSH recommender ?

2011-04-13 Thread ke xie
Ok, I would try to implement a none-distributed one. Actually I have a python version now. But I have a problem. When doing min-hash, the matrix should be either 1 or 0, and then do the hash functions. Then how about rating data? If the matrix is filled with 1~5 numbers, should we convert them use

Re: How about a LSH recommender ?

2011-04-13 Thread Ted Dunning
Sean, Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive hashing)? (are you a victim of agressive error correction?) (or am I the victim of too little?) On Wed, Apr 13, 2011 at 12:28 AM, Sean Owen wrote: > This approach is really three approaches put together. Elements of

Re: How about a LSH recommender ?

2011-04-13 Thread Sean Owen
This approach is really three approaches put together. Elements of two of the approaches exist in the project -- recommendations based on co-occurrence, and based on clustering (though not MinHash). I don't believe there's much proper LSI in the project at the moment? I would steer you towards loo

Re: How about a LSH recommender ?

2011-04-13 Thread Ted Dunning
Sure. LSH is a fine candidate for parallelism and scaling. I would recommend starting small and testing as you go rather than leaping into a parallelized full-fledged implementation. Look for other open-source implementaions of LSH algorithms. Be warned that the parameter selection for LSH can

How about a LSH recommender ?

2011-04-13 Thread ke xie
Dear all: I've read a paper from google, which is about their news recommender system. They implemented a LSH algorithm to find the closest neibourhoods and the algorithm is fast for that. Can we implement one and contribute into the mahout project? Any suggestions? paper is here: http://iws.seu