Actually i have to run the meanshift algorithm on a large dataset for my
project. the clusterdumper facility works on smaller data sets .
But my project will mostly include large-scale data (size will mostly
extend to gigabytes). So i need to modify the clusterdumper facility to
work on the such
Such big data would need to run on Hadoop cluster.
Right now, I think there is no utility which can help you collect data
in the form you want. You will have to read it line by line, group
vectors belonging to similar cluster. Would be good if you can write it
on file system incrementally, as
Thanks a lot for ur help. Yes i will be running it on a hadoop cluster. Can
u elaborate a bit on writing to file system incrementally..?
On Fri, Nov 4, 2011 at 11:51 AM, Paritosh Ranjan pran...@xebia.com wrote:
Such big data would need to run on Hadoop cluster.
Right now, I think there is no
pseudo code:
while(has next record in clustered ouput)
{
readNextRecord();
extractVectorAndClusterIdFromRecord();
if(directory of name ClusterId does not exist){
create directory of name clusterId
}
writeVectorInDirectoryNamedClusterId();
}
On 04-11-2011 12:09, gaurav redkar
Thanks a lot Paritosh.. i really appreciate ur help.
On Fri, Nov 4, 2011 at 12:15 PM, Paritosh Ranjan pran...@xebia.com wrote:
pseudo code:
while(has next record in clustered ouput)
{
readNextRecord();
extractVectorAndClusterIdFromR**ecord();
if(directory of name ClusterId does not
Thanks, that's what i need. I have another question, is there a recommend value
for the iteration and convergenceDelta in K-Means? Thanks a lot. Cheers Ramon
Date: Fri, 4 Nov 2011 08:07:01 +0530
From: pran...@xebia.com
To: user@mahout.apache.org
Subject: Re: How to find which point belongs
We've been debating removing/archiving the Watchmaker integration in Mahout due
to seeming lack of maintenance and interest. Is anybody actually using it?
-Grant
Subject: Re: How to find which point belongs which cluster after running
KMeansClusterer
From: gsing...@apache.org
Date: Fri, 4 Nov 2011 06:49:49 -0400
To: user@mahout.apache.org
On Nov 4, 2011, at 3:28 AM, WangRamon wrote:
Thanks, that's what i need. I have another question,
Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the
distance(double centroidLengthSquare, Vector centroid, Vector v) method
confused me a lot, i don't know why we choose this expression
centroidLengthSquare - 2 * v.dot(centroid) + v.getLengthSquared() to
calculate the
c = centroid
v = vector
(c - v)^2 = c^2 - 2cv + v^2
On 04.11.2011 15:58, WangRamon wrote:
Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the
distance(double centroidLengthSquare, Vector centroid, Vector v) method
confused me a lot, i don't know why we choose this
haha, thanks for the math, i almost forget
Date: Fri, 4 Nov 2011 16:01:02 +0100
From: s...@apache.org
To: user@mahout.apache.org
Subject: Re: Can anybody explain the distance method in
SquaredEuclideanDistanceMeasure?
c = centroid
v = vector
(c - v)^2 = c^2 - 2cv + v^2
On
On Nov 4, 2011, at 10:58 AM, WangRamon wrote:
Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the
distance(double centroidLengthSquare, Vector centroid, Vector v) method
confused me a lot, i don't know why we choose this expression
centroidLengthSquare - 2 *
hi
I m very new to mahout.
We use solr as our search engine and we have user query stored for
processing purposes.
I was wondering if we can classify these user search terms can be classified
into groups based on what data is being indexed.
Say for example our content is divided in two groups
2011/11/4 Grant Ingersoll gsing...@apache.org
Hi All I'm reading the code of SquaredEuclideanDistanceMeasure, the
distance(double centroidLengthSquare, Vector centroid, Vector v) method
confused me a lot, i don't know why we choose this expression
centroidLengthSquare - 2 * v.dot(centroid)
The last third of this book is on classification in Mahout:
http://www.manning.com/owen/
On Nov 4, 2011, at 12:20 PM, abhayd ajdabhol...@hotmail.com wrote:
hi
I m very new to mahout.
We use solr as our search engine and we have user query stored for
processing purposes.
I was
This can be a very hard problem.
Can you have access to a history of recent queries by the same user? That
could make things much easier.
On Nov 4, 2011, at 12:20 PM, abhayd ajdabhol...@hotmail.com wrote:
We use solr as our search engine and we have user query stored for
processing
You may also get some ideas here:
http://scholar.google.com/scholar?hl=ensciodt=0%2C10q=+lucenebtnG=Searchcites=12145019647228075107scipsc=1as_sdt=0%2C10as_ylo=as_vis=0
Sent from my iPad
On Nov 4, 2011, at 12:53 PM, Ted Dunning ted.dunn...@gmail.com wrote:
This can be a very hard problem.
Here are the files:
http://lucene.472066.n3.nabble.com/file/n3480755/Entertainment.zip
Entertainment.zip
http://lucene.472066.n3.nabble.com/file/n3480755/Health.zip Health.zip
http://lucene.472066.n3.nabble.com/file/n3480755/SciTech.zip SciTech.zip
There is an excellent paper on query chaining which you may find
interesting:
http://dl.acm.org/citation.cfm?id=1081899
Gustavo
On Fri, Nov 4, 2011 at 3:19 PM, Daniel Allen assis...@gmail.com wrote:
You may also get some ideas here:
I have a relatively large existing Lucene index which does not store vectors.
Size is approx. 100 million documents (about 1.5 TB in size).
I am thinking of using some lower level Lucene API code to extract vectors, by
enumerating terms and term docs collections.
Something like the following
Should be doable, but likely slow. Relative to the other things you are likely
doing, probably not a big deal.
In fact, I've thought about adding such a piece of code, so if you are looking
to contrib, it would be welcome.
On Nov 4, 2011, at 1:55 PM, Robert Stewart wrote:
I have a relatively
Ok that was what I thought. I'll give it a shot.
On Nov 4, 2011, at 2:05 PM, Grant Ingersoll wrote:
Should be doable, but likely slow. Relative to the other things you are
likely doing, probably not a big deal.
In fact, I've thought about adding such a piece of code, so if you are
It looks like a fine solution. It should be map-reducable as well if you
can build good splits on term space. That isn't quite as simple as it
looks since you probably want each mapper to read a consecutive sequence of
term id's and earlier term id's will be much, much more common than later
Thanks Ted,
One thing I don't get. Why would earlier term id's be much, much more common
that later ones? AFAIK, terms are sorted lexicographically, so earlier ones
are just AAA... instead of ZZZ... so I don't understand how that relates to
frequency. Probably I misunderstand what you mean
Hmmm... I think I may be out of date. Or not. Grant may be able to
resolve the question.
If term id's are assigned in order of appearance then the first id's
assigned will tend to be common terms.
But I think you are that the current Lucen index structure uses
lexicographic order for the terms
For those interested in Apache Mahout in the Bay Area,
I'd like to invite you to a Mahout User Meetup (I liked MUM better than MUG) on
Nov. 29th at Lucid Imagination. Unlike previous informal meetups we've had,
this one is going to be a bit more formal, in the sense that we are going to
have
Now that EuclideanDistanceSimilarity is different, how should it be changed
for the BookCrossing example?
--
Lance Norskog
goks...@gmail.com
27 matches
Mail list logo