subject:"Re\: Kmeans clusterdump Interpretation"

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel

Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
point central to a cluster. I guess in this case I would be looking for the
most central point vector (from the index ) that I can use as a
representative of the cluster.

On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 I'm not sure centroid id is even a defined thing, especially since the
 centroid, in my understanding, is just a point in space, not necessarily a
 point in your data.

 Are you trying to find the most-central point in a given cluster?

 On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com
 wrote:

  Hi,
  I've been messing with mahout 0.10 and kmeans clustering with a solr
 4.6.1
  index. The data is news articles. The --field option for kmeans is set to
  content. The idField is set to title (just so i can analyse it
 faster).
  The clusterdump of the kmeans result gives me a proper output, but I cant
  figure out the id of the vector chosen as the center. There are only
 14-15
  articles so I am not hung up about the cluster performance at this time.
 
  I used random seeds for the kmeans commandline.
  For reference, this is the commandline cluster dump I am executing
 
  bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
  -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5
 
  The output I get is off the form
 
  :{r:
 
  top terms
 
  x==x
 
  Weight : [props - optional]:  Point:
 
   1.0 : [distance=0.0]: [{account:0.026}...other features]
 
  1.0 : [distance=0.3963903651622338]: []
 
 
  So how exactly do I get the centroid id? I have even tried accessing it
  with java
 
  ClusterWritable value.getValue().getCenter() but this just gives me the
  features and values of the centroid.
 
  Also, please do explain the meaning of account:0.026 (just making sure
 I
  know it right). I used tfidf.
 
  --
  Regards,
  Ankit Goel
  http://about.me/ankitgoel
 




-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning

The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).

The Mahout code does not compute medoids.  In general, they are difficult
to compute and implementing a full k-medoid clustering algorithm even more
so.



On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote:

 Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
 point central to a cluster. I guess in this case I would be looking for the
 most central point vector (from the index ) that I can use as a
 representative of the cluster.

 On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

  I'm not sure centroid id is even a defined thing, especially since the
  centroid, in my understanding, is just a point in space, not necessarily
 a
  point in your data.
 
  Are you trying to find the most-central point in a given cluster?
 
  On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com
  wrote:
 
   Hi,
   I've been messing with mahout 0.10 and kmeans clustering with a solr
  4.6.1
   index. The data is news articles. The --field option for kmeans is set
 to
   content. The idField is set to title (just so i can analyse it
  faster).
   The clusterdump of the kmeans result gives me a proper output, but I
 cant
   figure out the id of the vector chosen as the center. There are only
  14-15
   articles so I am not hung up about the cluster performance at this
 time.
  
   I used random seeds for the kmeans commandline.
   For reference, this is the commandline cluster dump I am executing
  
   bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
   -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
 -b 5
  
   The output I get is off the form
  
   :{r:
  
   top terms
  
   x==x
  
   Weight : [props - optional]:  Point:
  
1.0 : [distance=0.0]: [{account:0.026}...other features]
  
   1.0 : [distance=0.3963903651622338]: []
  
  
   So how exactly do I get the centroid id? I have even tried accessing it
   with java
  
   ClusterWritable value.getValue().getCenter() but this just gives me the
   features and values of the centroid.
  
   Also, please do explain the meaning of account:0.026 (just making
 sure
  I
   know it right). I used tfidf.
  
   --
   Regards,
   Ankit Goel
   http://about.me/ankitgoel
  
 



 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel

That kind of puts me in a tough position. I was planning to use kmeans as a
method for aggregating similar articles from multiple news sources, and
then getting a representative article from those. Here I mean similar as in
the articles are from different news sources but are about the exact same
thing. Intuitively it seems that these articles would get grouped
together. Any suggestions how I should go about that? So far I'm using
nutch to crawl, solr to index and now I'm here on mahout.

On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 The most central point in a cluster is often referred to as a medoid
 (similar to median, but multi-dimensional).

 The Mahout code does not compute medoids.  In general, they are difficult
 to compute and implementing a full k-medoid clustering algorithm even more
 so.



 On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com
 wrote:

  Oh, I thought kmeans gave me a point vector as a centroid, not a
 calculated
  point central to a cluster. I guess in this case I would be looking for
 the
  most central point vector (from the index ) that I can use as a
  representative of the cluster.
 
  On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
   I'm not sure centroid id is even a defined thing, especially since the
   centroid, in my understanding, is just a point in space, not
 necessarily
  a
   point in your data.
  
   Are you trying to find the most-central point in a given cluster?
  
   On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com
   wrote:
  
Hi,
I've been messing with mahout 0.10 and kmeans clustering with a solr
   4.6.1
index. The data is news articles. The --field option for kmeans is
 set
  to
content. The idField is set to title (just so i can analyse it
   faster).
The clusterdump of the kmeans result gives me a proper output, but I
  cant
figure out the id of the vector chosen as the center. There are only
   14-15
articles so I am not hung up about the cluster performance at this
  time.
   
I used random seeds for the kmeans commandline.
For reference, this is the commandline cluster dump I am executing
   
bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
-p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
  -b 5
   
The output I get is off the form
   
:{r:
   
top terms
   
x==x
   
Weight : [props - optional]:  Point:
   
 1.0 : [distance=0.0]: [{account:0.026}...other features]
   
1.0 : [distance=0.3963903651622338]: []
   
   
So how exactly do I get the centroid id? I have even tried accessing
 it
with java
   
ClusterWritable value.getValue().getCenter() but this just gives me
 the
features and values of the centroid.
   
Also, please do explain the meaning of account:0.026 (just making
  sure
   I
know it right). I used tfidf.
   
--
Regards,
Ankit Goel
http://about.me/ankitgoel
   
  
 
 
 
  --
  Regards,
  Ankit Goel
  http://about.me/ankitgoel
 




-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Andrew Musselman

It's possible you could write a post-processing step to find the closest
point to the centroid based on the distance property if I'm recalling it
correctly.

On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote:

 That kind of puts me in a tough position. I was planning to use kmeans as a
 method for aggregating similar articles from multiple news sources, and
 then getting a representative article from those. Here I mean similar as in
 the articles are from different news sources but are about the exact same
 thing. Intuitively it seems that these articles would get grouped
 together. Any suggestions how I should go about that? So far I'm using
 nutch to crawl, solr to index and now I'm here on mahout.

 On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  The most central point in a cluster is often referred to as a medoid
  (similar to median, but multi-dimensional).
 
  The Mahout code does not compute medoids.  In general, they are difficult
  to compute and implementing a full k-medoid clustering algorithm even
 more
  so.
 
 
 
  On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com
  wrote:
 
   Oh, I thought kmeans gave me a point vector as a centroid, not a
  calculated
   point central to a cluster. I guess in this case I would be looking for
  the
   most central point vector (from the index ) that I can use as a
   representative of the cluster.
  
   On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
   andrew.mussel...@gmail.com wrote:
  
I'm not sure centroid id is even a defined thing, especially since
 the
centroid, in my understanding, is just a point in space, not
  necessarily
   a
point in your data.
   
Are you trying to find the most-central point in a given cluster?
   
On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com
 
wrote:
   
 Hi,
 I've been messing with mahout 0.10 and kmeans clustering with a
 solr
4.6.1
 index. The data is news articles. The --field option for kmeans is
  set
   to
 content. The idField is set to title (just so i can analyse it
faster).
 The clusterdump of the kmeans result gives me a proper output, but
 I
   cant
 figure out the id of the vector chosen as the center. There are
 only
14-15
 articles so I am not hung up about the cluster performance at this
   time.

 I used random seeds for the kmeans commandline.
 For reference, this is the commandline cluster dump I am executing

 bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
 -p $MAHOUT_HOME/testCluster/clusteredPoints -d
 $MAHOUT_HOME/dict.txt
   -b 5

 The output I get is off the form

 :{r:

 top terms

 x==x

 Weight : [props - optional]:  Point:

  1.0 : [distance=0.0]: [{account:0.026}...other features]

 1.0 : [distance=0.3963903651622338]: []


 So how exactly do I get the centroid id? I have even tried
 accessing
  it
 with java

 ClusterWritable value.getValue().getCenter() but this just gives me
  the
 features and values of the centroid.

 Also, please do explain the meaning of account:0.026 (just making
   sure
I
 know it right). I used tfidf.

 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

   
  
  
  
   --
   Regards,
   Ankit Goel
   http://about.me/ankitgoel
  
 



 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Andrew Musselman

I'm not sure centroid id is even a defined thing, especially since the
centroid, in my understanding, is just a point in space, not necessarily a
point in your data.

Are you trying to find the most-central point in a given cluster?

On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote:

 Hi,
 I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
 index. The data is news articles. The --field option for kmeans is set to
 content. The idField is set to title (just so i can analyse it faster).
 The clusterdump of the kmeans result gives me a proper output, but I cant
 figure out the id of the vector chosen as the center. There are only 14-15
 articles so I am not hung up about the cluster performance at this time.

 I used random seeds for the kmeans commandline.
 For reference, this is the commandline cluster dump I am executing

 bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
 -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5

 The output I get is off the form

 :{r:

 top terms

 x==x

 Weight : [props - optional]:  Point:

  1.0 : [distance=0.0]: [{account:0.026}...other features]

 1.0 : [distance=0.3963903651622338]: []


 So how exactly do I get the centroid id? I have even tried accessing it
 with java

 ClusterWritable value.getValue().getCenter() but this just gives me the
 features and values of the centroid.

 Also, please do explain the meaning of account:0.026 (just making sure I
 know it right). I used tfidf.

 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel

Hmm, kmeans algorithmically is supposed to only annoint existing
vectors(documents) as the centroid for a cluster every step (or so I
believe). If mahout is generating non document vector as a centroid, it
changes a lot of things.

That would also explain the -distanceMeasure option in clusterdump. As
Andrew mentions, running clusterdump with the default euclidean measure
should give me the closest document vector to the calculated centroid.
Please correct me if I'm wrong anywhere.
Thanks

On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 It's possible you could write a post-processing step to find the closest
 point to the centroid based on the distance property if I'm recalling it
 correctly.

 On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com
 wrote:

  That kind of puts me in a tough position. I was planning to use kmeans
 as a
  method for aggregating similar articles from multiple news sources, and
  then getting a representative article from those. Here I mean similar as
 in
  the articles are from different news sources but are about the exact same
  thing. Intuitively it seems that these articles would get grouped
  together. Any suggestions how I should go about that? So far I'm using
  nutch to crawl, solr to index and now I'm here on mahout.
 
  On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   The most central point in a cluster is often referred to as a medoid
   (similar to median, but multi-dimensional).
  
   The Mahout code does not compute medoids.  In general, they are
 difficult
   to compute and implementing a full k-medoid clustering algorithm even
  more
   so.
  
  
  
   On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com
   wrote:
  
Oh, I thought kmeans gave me a point vector as a centroid, not a
   calculated
point central to a cluster. I guess in this case I would be looking
 for
   the
most central point vector (from the index ) that I can use as a
representative of the cluster.
   
On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:
   
 I'm not sure centroid id is even a defined thing, especially since
  the
 centroid, in my understanding, is just a point in space, not
   necessarily
a
 point in your data.

 Are you trying to find the most-central point in a given cluster?

 On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel 
 ankitgoel2...@gmail.com
  
 wrote:

  Hi,
  I've been messing with mahout 0.10 and kmeans clustering with a
  solr
 4.6.1
  index. The data is news articles. The --field option for kmeans
 is
   set
to
  content. The idField is set to title (just so i can analyse
 it
 faster).
  The clusterdump of the kmeans result gives me a proper output,
 but
  I
cant
  figure out the id of the vector chosen as the center. There are
  only
 14-15
  articles so I am not hung up about the cluster performance at
 this
time.
 
  I used random seeds for the kmeans commandline.
  For reference, this is the commandline cluster dump I am
 executing
 
  bin/mahout clusterdump -i
 $MAHOUT_HOME/testCluster/clusters-3-final
  -p $MAHOUT_HOME/testCluster/clusteredPoints -d
  $MAHOUT_HOME/dict.txt
-b 5
 
  The output I get is off the form
 
  :{r:
 
  top terms
 
  x==x
 
  Weight : [props - optional]:  Point:
 
   1.0 : [distance=0.0]: [{account:0.026}...other features]
 
  1.0 : [distance=0.3963903651622338]: []
 
 
  So how exactly do I get the centroid id? I have even tried
  accessing
   it
  with java
 
  ClusterWritable value.getValue().getCenter() but this just gives
 me
   the
  features and values of the centroid.
 
  Also, please do explain the meaning of account:0.026 (just
 making
sure
 I
  know it right). I used tfidf.
 
  --
  Regards,
  Ankit Goel
  http://about.me/ankitgoel
 

   
   
   
--
Regards,
Ankit Goel
http://about.me/ankitgoel
   
  
 
 
 
  --
  Regards,
  Ankit Goel
  http://about.me/ankitgoel
 




-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning

You can always just pick the article closest to the centroid.

But I think that you may find that with simple k-means that clusters are
going to be about more than one thing.



On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com wrote:

 Hmm, kmeans algorithmically is supposed to only annoint existing
 vectors(documents) as the centroid for a cluster every step (or so I
 believe). If mahout is generating non document vector as a centroid, it
 changes a lot of things.

 That would also explain the -distanceMeasure option in clusterdump. As
 Andrew mentions, running clusterdump with the default euclidean measure
 should give me the closest document vector to the calculated centroid.
 Please correct me if I'm wrong anywhere.
 Thanks

 On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

  It's possible you could write a post-processing step to find the closest
  point to the centroid based on the distance property if I'm recalling
 it
  correctly.
 
  On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com
  wrote:
 
   That kind of puts me in a tough position. I was planning to use kmeans
  as a
   method for aggregating similar articles from multiple news sources, and
   then getting a representative article from those. Here I mean similar
 as
  in
   the articles are from different news sources but are about the exact
 same
   thing. Intuitively it seems that these articles would get grouped
   together. Any suggestions how I should go about that? So far I'm using
   nutch to crawl, solr to index and now I'm here on mahout.
  
   On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).
   
The Mahout code does not compute medoids.  In general, they are
  difficult
to compute and implementing a full k-medoid clustering algorithm even
   more
so.
   
   
   
On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com
 
wrote:
   
 Oh, I thought kmeans gave me a point vector as a centroid, not a
calculated
 point central to a cluster. I guess in this case I would be looking
  for
the
 most central point vector (from the index ) that I can use as a
 representative of the cluster.

 On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

  I'm not sure centroid id is even a defined thing, especially
 since
   the
  centroid, in my understanding, is just a point in space, not
necessarily
 a
  point in your data.
 
  Are you trying to find the most-central point in a given cluster?
 
  On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel 
  ankitgoel2...@gmail.com
   
  wrote:
 
   Hi,
   I've been messing with mahout 0.10 and kmeans clustering with a
   solr
  4.6.1
   index. The data is news articles. The --field option for kmeans
  is
set
 to
   content. The idField is set to title (just so i can analyse
  it
  faster).
   The clusterdump of the kmeans result gives me a proper output,
  but
   I
 cant
   figure out the id of the vector chosen as the center. There are
   only
  14-15
   articles so I am not hung up about the cluster performance at
  this
 time.
  
   I used random seeds for the kmeans commandline.
   For reference, this is the commandline cluster dump I am
  executing
  
   bin/mahout clusterdump -i
  $MAHOUT_HOME/testCluster/clusters-3-final
   -p $MAHOUT_HOME/testCluster/clusteredPoints -d
   $MAHOUT_HOME/dict.txt
 -b 5
  
   The output I get is off the form
  
   :{r:
  
   top terms
  
   x==x
  
   Weight : [props - optional]:  Point:
  
1.0 : [distance=0.0]: [{account:0.026}...other features]
  
   1.0 : [distance=0.3963903651622338]: []
  
  
   So how exactly do I get the centroid id? I have even tried
   accessing
it
   with java
  
   ClusterWritable value.getValue().getCenter() but this just
 gives
  me
the
   features and values of the centroid.
  
   Also, please do explain the meaning of account:0.026 (just
  making
 sure
  I
   know it right). I used tfidf.
  
   --
   Regards,
   Ankit Goel
   http://about.me/ankitgoel
  
 



 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

   
  
  
  
   --
   Regards,
   Ankit Goel
   http://about.me/ankitgoel
  
 



 --
 Regards,
 Ankit Goel
 http://about.me/ankitgoel

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel

True that. Kmeans is just a first step anyways. Definetely needs tuning.
Thanks guys

On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 You can always just pick the article closest to the centroid.

 But I think that you may find that with simple k-means that clusters are
 going to be about more than one thing.



 On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com
 wrote:

  Hmm, kmeans algorithmically is supposed to only annoint existing
  vectors(documents) as the centroid for a cluster every step (or so I
  believe). If mahout is generating non document vector as a centroid, it
  changes a lot of things.
 
  That would also explain the -distanceMeasure option in clusterdump. As
  Andrew mentions, running clusterdump with the default euclidean measure
  should give me the closest document vector to the calculated centroid.
  Please correct me if I'm wrong anywhere.
  Thanks
 
  On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
   It's possible you could write a post-processing step to find the
 closest
   point to the centroid based on the distance property if I'm recalling
  it
   correctly.
  
   On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com
   wrote:
  
That kind of puts me in a tough position. I was planning to use
 kmeans
   as a
method for aggregating similar articles from multiple news sources,
 and
then getting a representative article from those. Here I mean similar
  as
   in
the articles are from different news sources but are about the exact
  same
thing. Intuitively it seems that these articles would get grouped
together. Any suggestions how I should go about that? So far I'm
 using
nutch to crawl, solr to index and now I'm here on mahout.
   
On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 The most central point in a cluster is often referred to as a
 medoid
 (similar to median, but multi-dimensional).

 The Mahout code does not compute medoids.  In general, they are
   difficult
 to compute and implementing a full k-medoid clustering algorithm
 even
more
 so.



 On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel 
 ankitgoel2...@gmail.com
  
 wrote:

  Oh, I thought kmeans gave me a point vector as a centroid, not a
 calculated
  point central to a cluster. I guess in this case I would be
 looking
   for
 the
  most central point vector (from the index ) that I can use as a
  representative of the cluster.
 
  On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
   I'm not sure centroid id is even a defined thing, especially
  since
the
   centroid, in my understanding, is just a point in space, not
 necessarily
  a
   point in your data.
  
   Are you trying to find the most-central point in a given
 cluster?
  
   On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel 
   ankitgoel2...@gmail.com

   wrote:
  
Hi,
I've been messing with mahout 0.10 and kmeans clustering
 with a
solr
   4.6.1
index. The data is news articles. The --field option for
 kmeans
   is
 set
  to
content. The idField is set to title (just so i can
 analyse
   it
   faster).
The clusterdump of the kmeans result gives me a proper
 output,
   but
I
  cant
figure out the id of the vector chosen as the center. There
 are
only
   14-15
articles so I am not hung up about the cluster performance at
   this
  time.
   
I used random seeds for the kmeans commandline.
For reference, this is the commandline cluster dump I am
   executing
   
bin/mahout clusterdump -i
   $MAHOUT_HOME/testCluster/clusters-3-final
-p $MAHOUT_HOME/testCluster/clusteredPoints -d
$MAHOUT_HOME/dict.txt
  -b 5
   
The output I get is off the form
   
:{r:
   
top terms
   
x==x
   
Weight : [props - optional]:  Point:
   
 1.0 : [distance=0.0]: [{account:0.026}...other
 features]
   
1.0 : [distance=0.3963903651622338]: []
   
   
So how exactly do I get the centroid id? I have even tried
accessing
 it
with java
   
ClusterWritable value.getValue().getCenter() but this just
  gives
   me
 the
features and values of the centroid.
   
Also, please do explain the meaning of account:0.026 (just
   making
  sure
   I
know it right). I used tfidf.
   
--
Regards,
Ankit Goel
http://about.me/ankitgoel
   
  
 
 
 
  --
  Regards,
  Ankit Goel
  http://about.me/ankitgoel
 

   
   
   
--
Regards,
Ankit Goel

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

Re: Kmeans clusterdump Interpretation

8 matches

Site Navigation

Mail list logo

Footer information