Re: Preserving pairwise distances while normalizing vectors

2011-07-22 Thread Sean Owen
I think Ted is suggesting augmenting the vectors to (1,0,0,100) and (10,0,0,100) and projecting onto the unit sphere in 4 dimensions. Then the distance is not 0 on the surface of that sphere. On Fri, Jul 22, 2011 at 7:29 AM, Jake Mannix wrote: > (1, 0, 0) and (10, 0, 0) have very large distance

is there exist lda classifier with trained probabilistic model?

2011-07-22 Thread jun li
Hi, all I found in lingpipe book, there is a ldaclassifer which just load trained model and symbol table ( id mapping to word string) and classify new document? can lda in mahout providing the same function or command ? thanks. -- Li Jun

Re: Wald's Test / parameter significance tests (Logistic Regression)

2011-07-22 Thread Svetlomir Kasabov
Hello Ted, thanks for your reply and detailed answer. I will probably use the L_1 regularization since you recommended it. Can I use Mahout's class L1 for this case ? Which other classes can be useful? Actially I thought it can solve this problem easier: Quote from: http://webcache.googleu

Pairwise Document Similarity

2011-07-22 Thread Niall Riddell
Hi, I would like to sense check an approach to near-duplicate detection of documents using Mahout. After some basic research I've implemented a basic proof which works effectively on a small corpus. I have taken the following pre-processing steps: 1) Parse the document 2) Remove unnecessary tok

df-count/data does not exist

2011-07-22 Thread Liliana Mamani Sanchez
Hello all, I was trying to run a basic canopy clustering command: bin/mahout canopy -i vectordata -o output1 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2 and I get the exception: Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs:

Re: df-count/data does not exist

2011-07-22 Thread Robin Anil
What does the folder input vectordata contain? I am guessing you gave the top level directory instead of giving the tfidf-vectors folder as input Robin On Fri, Jul 22, 2011 at 8:33 PM, Liliana Mamani Sanchez wrote: > Hello all, > > I was trying to run a basic canopy clustering command: > > > bin

Re: Pairwise Document Similarity

2011-07-22 Thread Grant Ingersoll
On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote: > > > I've gone through MIA and felt the the rowsimilarityjob was a > possibility, however I understand that a JIRA has been raised to make > this potentially less general and in it's current form it may not > match my performance/cost criteria (

Re: Wald's Test / parameter significance tests (Logistic Regression)

2011-07-22 Thread Ted Dunning
On Fri, Jul 22, 2011 at 3:33 AM, Svetlomir Kasabov < skasa...@smail.inf.fh-brs.de> wrote: > thanks for your reply and detailed answer. I will probably use the L_1 > regularization since you recommended it. Can I use Mahout's class L1 for > this case ? Which other classes can be useful? > OnlineL

Re: is there exist lda classifier with trained probabilistic model?

2011-07-22 Thread Ted Dunning
Not in the same form for LDA. You can definitely use LDA to build feature vectors and then classifier using those features using OnlineLogisticRegression. On Fri, Jul 22, 2011 at 12:56 AM, jun li wrote: > > I found in lingpipe book, there is a ldaclassifer which just load trained > model and s

Re: Preserving pairwise distances while normalizing vectors

2011-07-22 Thread Ted Dunning
Sean is correct. And this will change the distances, but not the ratios of the distances because small patch of the sphere is nearly isometric with the original space. On Fri, Jul 22, 2011 at 12:46 AM, Sean Owen wrote: > I think Ted is suggesting augmenting the vectors to (1,0,0,100) and > (10

Re: Broken links

2011-07-22 Thread Joanne Sun
Hi I have a humble question. I wonder what is the relation between Lucene and mahout? Thanks, Joanne On Fri, Jul 8, 2011 at 7:12 AM, Sean Owen wrote: > (I've just removed that old page to avoid confusion.) > > On Fri, Jul 8, 2011 at 1:46 PM, Maƫl Thomas > wrote: > >> Hello >> >> The page http:/

Re: Broken links

2011-07-22 Thread Ted Dunning
It is a family relationship for the most part. Mahout came from the Lucene community. Mahout still uses Lucene.Some Lucene users use Mahout, but Lucene and Solr themselves do not depend on Mahout. On Fri, Jul 22, 2011 at 2:57 PM, Joanne Sun wrote: > Hi I have a humble question. I wonder wh

Re: is there exist lda classifier with trained probabilistic model?

2011-07-22 Thread Jake Mannix
We don't currently have inference on unseen documents as part of the mahout shell script, but there is a method you can use with very little modification: LDADriver.computeDocumentTopicProbabilities() It will take in a SequenceFile with any kind of keys, and values which are VectorWritable, as lo

Re: Preserving pairwise distances while normalizing vectors

2011-07-22 Thread Lance Norskog
Thanks, folks. On Fri, Jul 22, 2011 at 2:55 PM, Ted Dunning wrote: > Sean is correct. > > And this will change the distances, but not the ratios of the distances > because small patch of the sphere is nearly isometric with the original > space. > > > On Fri, Jul 22, 2011 at 12:46 AM, Sean Owen w