Re: Kmeans clusterdump Interpretation
Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
It's possible you could write a post-processing step to find the closest point to the centroid based on the distance property if I'm recalling it correctly. On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote: That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
Hmm, kmeans algorithmically is supposed to only annoint existing vectors(documents) as the centroid for a cluster every step (or so I believe). If mahout is generating non document vector as a centroid, it changes a lot of things. That would also explain the -distanceMeasure option in clusterdump. As Andrew mentions, running clusterdump with the default euclidean measure should give me the closest document vector to the calculated centroid. Please correct me if I'm wrong anywhere. Thanks On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: It's possible you could write a post-processing step to find the closest point to the centroid based on the distance property if I'm recalling it correctly. On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote: That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
You can always just pick the article closest to the centroid. But I think that you may find that with simple k-means that clusters are going to be about more than one thing. On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hmm, kmeans algorithmically is supposed to only annoint existing vectors(documents) as the centroid for a cluster every step (or so I believe). If mahout is generating non document vector as a centroid, it changes a lot of things. That would also explain the -distanceMeasure option in clusterdump. As Andrew mentions, running clusterdump with the default euclidean measure should give me the closest document vector to the calculated centroid. Please correct me if I'm wrong anywhere. Thanks On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: It's possible you could write a post-processing step to find the closest point to the centroid based on the distance property if I'm recalling it correctly. On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote: That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel
Re: Kmeans clusterdump Interpretation
True that. Kmeans is just a first step anyways. Definetely needs tuning. Thanks guys On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning ted.dunn...@gmail.com wrote: You can always just pick the article closest to the centroid. But I think that you may find that with simple k-means that clusters are going to be about more than one thing. On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hmm, kmeans algorithmically is supposed to only annoint existing vectors(documents) as the centroid for a cluster every step (or so I believe). If mahout is generating non document vector as a centroid, it changes a lot of things. That would also explain the -distanceMeasure option in clusterdump. As Andrew mentions, running clusterdump with the default euclidean measure should give me the closest document vector to the calculated centroid. Please correct me if I'm wrong anywhere. Thanks On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: It's possible you could write a post-processing step to find the closest point to the centroid based on the distance property if I'm recalling it correctly. On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote: That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same thing. Intuitively it seems that these articles would get grouped together. Any suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hi, I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1 index. The data is news articles. The --field option for kmeans is set to content. The idField is set to title (just so i can analyse it faster). The clusterdump of the kmeans result gives me a proper output, but I cant figure out the id of the vector chosen as the center. There are only 14-15 articles so I am not hung up about the cluster performance at this time. I used random seeds for the kmeans commandline. For reference, this is the commandline cluster dump I am executing bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5 The output I get is off the form :{r: top terms x==x Weight : [props - optional]: Point: 1.0 : [distance=0.0]: [{account:0.026}...other features] 1.0 : [distance=0.3963903651622338]: [] So how exactly do I get the centroid id? I have even tried accessing it with java ClusterWritable value.getValue().getCenter() but this just gives me the features and values of the centroid. Also, please do explain the meaning of account:0.026 (just making sure I know it right). I used tfidf. -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel http://about.me/ankitgoel -- Regards, Ankit Goel