Re: Cluster dumper crashes when run on a large dataset

2011-11-04 Thread gaurav redkar
Actually i have to run the meanshift algorithm on a large dataset for my project. the clusterdumper facility works on smaller data sets . But my project will mostly include large-scale data (size will mostly extend to gigabytes). So i need to modify the clusterdumper facility to work on the such

Re: Cluster dumper crashes when run on a large dataset

2011-11-04 Thread Paritosh Ranjan
Such big data would need to run on Hadoop cluster. Right now, I think there is no utility which can help you collect data in the form you want. You will have to read it line by line, group vectors belonging to similar cluster. Would be good if you can write it on file system incrementally, as

Re: Cluster dumper crashes when run on a large dataset

2011-11-04 Thread gaurav redkar
Thanks a lot for ur help. Yes i will be running it on a hadoop cluster. Can u elaborate a bit on writing to file system incrementally..? On Fri, Nov 4, 2011 at 11:51 AM, Paritosh Ranjan pran...@xebia.com wrote: Such big data would need to run on Hadoop cluster. Right now, I think there is no

Re: Cluster dumper crashes when run on a large dataset

2011-11-04 Thread Paritosh Ranjan
pseudo code: while(has next record in clustered ouput) { readNextRecord(); extractVectorAndClusterIdFromRecord(); if(directory of name ClusterId does not exist){ create directory of name clusterId } writeVectorInDirectoryNamedClusterId(); } On 04-11-2011 12:09, gaurav redkar

Re: Cluster dumper crashes when run on a large dataset

2011-11-04 Thread gaurav redkar
Thanks a lot Paritosh.. i really appreciate ur help. On Fri, Nov 4, 2011 at 12:15 PM, Paritosh Ranjan pran...@xebia.com wrote: pseudo code: while(has next record in clustered ouput) { readNextRecord(); extractVectorAndClusterIdFromR**ecord(); if(directory of name ClusterId does not

RE: How to find which point belongs which cluster after running KMeansClusterer

2011-11-04 Thread WangRamon
Thanks, that's what i need. I have another question, is there a recommend value for the iteration and convergenceDelta in K-Means? Thanks a lot. Cheers Ramon Date: Fri, 4 Nov 2011 08:07:01 +0530 From: pran...@xebia.com To: user@mahout.apache.org Subject: Re: How to find which point belongs

Watchmaker framework usage

2011-11-04 Thread Grant Ingersoll
We've been debating removing/archiving the Watchmaker integration in Mahout due to seeming lack of maintenance and interest. Is anybody actually using it? -Grant

RE: How to find which point belongs which cluster after running KMeansClusterer

2011-11-04 Thread WangRamon
Subject: Re: How to find which point belongs which cluster after running KMeansClusterer From: gsing...@apache.org Date: Fri, 4 Nov 2011 06:49:49 -0400 To: user@mahout.apache.org On Nov 4, 2011, at 3:28 AM, WangRamon wrote: Thanks, that's what i need. I have another question,

Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread WangRamon
Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the distance(double centroidLengthSquare, Vector centroid, Vector v) method confused me a lot, i don't know why we choose this expression centroidLengthSquare - 2 * v.dot(centroid) + v.getLengthSquared() to calculate the

Re: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread Sebastian Schelter
c = centroid v = vector (c - v)^2 = c^2 - 2cv + v^2 On 04.11.2011 15:58, WangRamon wrote: Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the distance(double centroidLengthSquare, Vector centroid, Vector v) method confused me a lot, i don't know why we choose this

RE: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread WangRamon
haha, thanks for the math, i almost forget Date: Fri, 4 Nov 2011 16:01:02 +0100 From: s...@apache.org To: user@mahout.apache.org Subject: Re: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure? c = centroid v = vector (c - v)^2 = c^2 - 2cv + v^2 On

Re: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread Grant Ingersoll
On Nov 4, 2011, at 10:58 AM, WangRamon wrote: Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the distance(double centroidLengthSquare, Vector centroid, Vector v) method confused me a lot, i don't know why we choose this expression centroidLengthSquare - 2 *

classification of search queries

2011-11-04 Thread abhayd
hi I m very new to mahout. We use solr as our search engine and we have user query stored for processing purposes. I was wondering if we can classify these user search terms can be classified into groups based on what data is being indexed. Say for example our content is divided in two groups

Re: Can anybody explain the distance method in SquaredEuclideanDistanceMeasure?

2011-11-04 Thread Ted Dunning
2011/11/4 Grant Ingersoll gsing...@apache.org Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the distance(double centroidLengthSquare, Vector centroid, Vector v) method confused me a lot, i don't know why we choose this expression centroidLengthSquare - 2 * v.dot(centroid)

Re: classification of search queries

2011-11-04 Thread Daniel Allen
The last third of this book is on classification in Mahout: http://www.manning.com/owen/ On Nov 4, 2011, at 12:20 PM, abhayd ajdabhol...@hotmail.com wrote: hi I m very new to mahout. We use solr as our search engine and we have user query stored for processing purposes. I was

Re: classification of search queries

2011-11-04 Thread Ted Dunning
This can be a very hard problem. Can you have access to a history of recent queries by the same user? That could make things much easier. On Nov 4, 2011, at 12:20 PM, abhayd ajdabhol...@hotmail.com wrote: We use solr as our search engine and we have user query stored for processing

Re: classification of search queries

2011-11-04 Thread Daniel Allen
You may also get some ideas here: http://scholar.google.com/scholar?hl=ensciodt=0%2C10q=+lucenebtnG=Searchcites=12145019647228075107scipsc=1as_sdt=0%2C10as_ylo=as_vis=0 Sent from my iPad On Nov 4, 2011, at 12:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: This can be a very hard problem.

Re: NaN - classification results (cbayes)

2011-11-04 Thread Sam Cunningham
Here are the files: http://lucene.472066.n3.nabble.com/file/n3480755/Entertainment.zip Entertainment.zip http://lucene.472066.n3.nabble.com/file/n3480755/Health.zip Health.zip http://lucene.472066.n3.nabble.com/file/n3480755/SciTech.zip SciTech.zip

Re: classification of search queries

2011-11-04 Thread Gustavo Enrique Salazar Torres
There is an excellent paper on query chaining which you may find interesting: http://dl.acm.org/citation.cfm?id=1081899 Gustavo On Fri, Nov 4, 2011 at 3:19 PM, Daniel Allen assis...@gmail.com wrote: You may also get some ideas here:

creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Robert Stewart
I have a relatively large existing Lucene index which does not store vectors. Size is approx. 100 million documents (about 1.5 TB in size). I am thinking of using some lower level Lucene API code to extract vectors, by enumerating terms and term docs collections. Something like the following

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Grant Ingersoll
Should be doable, but likely slow. Relative to the other things you are likely doing, probably not a big deal. In fact, I've thought about adding such a piece of code, so if you are looking to contrib, it would be welcome. On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote: I have a relatively

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Robert Stewart
Ok that was what I thought. I'll give it a shot. On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote: Should be doable, but likely slow. Relative to the other things you are likely doing, probably not a big deal. In fact, I've thought about adding such a piece of code, so if you are

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Ted Dunning
It looks like a fine solution. It should be map-reducable as well if you can build good splits on term space. That isn't quite as simple as it looks since you probably want each mapper to read a consecutive sequence of term id's and earlier term id's will be much, much more common than later

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Robert Stewart
Thanks Ted, One thing I don't get. Why would earlier term id's be much, much more common that later ones? AFAIK, terms are sorted lexicographically, so earlier ones are just AAA... instead of ZZZ... so I don't understand how that relates to frequency. Probably I misunderstand what you mean

Re: creating vectors from lucene index which does NOT store vectors

2011-11-04 Thread Ted Dunning
Hmmm... I think I may be out of date. Or not. Grant may be able to resolve the question. If term id's are assigned in order of appearance then the first id's assigned will tend to be common terms. But I think you are that the current Lucen index structure uses lexicographic order for the terms

SF Apache Mahout User Meeting (MUM) Nov 29th @ Lucid Imagination HQ

2011-11-04 Thread Grant Ingersoll
For those interested in Apache Mahout in the Bay Area, I'd like to invite you to a Mahout User Meetup (I liked MUM better than MUG) on Nov. 29th at Lucid Imagination. Unlike previous informal meetups we've had, this one is going to be a bit more formal, in the sense that we are going to have

Nearest Neighbor Recommender and Euclidean distance similarity

2011-11-04 Thread Lance Norskog
Now that EuclideanDistanceSimilarity is different, how should it be changed for the BookCrossing example? -- Lance Norskog goks...@gmail.com