Re: Clustering from DB

Grant Ingersoll Wed, 01 Jul 2009 20:33:15 -0700


On Jul 1, 2009, at 9:37 AM, nfantone wrote:

Ok, so I managed to write a VectorIterable implementation to draw data
from my database. Now, I'm in the process of understanding the output
file that kMeans (with a Canopy input) produces. Someone, please,
correct me if I'm mistaken. At first, my thought was that there were
as many "cluster-i" directories as clusters detected from the dataset
by the algorithm(s), until I printed out the content of the
"part-00000" file in them. It seems as though it stores a <Writable>
cluster ID and then a <Writable> Cluster, each line. Are those all the
actual clusters detected? If so, what's the reason behind the
directory nomenclature and its consecutive enumeration?

I was wondering the same thing myself. I believe it has to do withthe number of iterations or reduce tasks, but I haven't looked closelyat the code yet. Maybe Jeff can jump in here.

Does every
"part-00000", in different "cluster-i" directories, hold different
clusters? And, what about the "points" directory? I can tell it
follows a <VectorID, Value> register format. What's that value
supposed to represent? The ID from the cluster it belongs, perhaps?


I believe this is the case.


There really ought to be documentation about this somewhere. I don't
know if I need some kind of permission, but I'm offering myself to
write it and upload it to the Mahout wiki or wherever it should be,
once I finished my project.

+1

Thanks in advanced.

On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<[email protected]> wrote:

All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
exception since it has a core that is independent of Hadoop and can
use data from files, databases, etc. It also happens to have some
clustering logic. So you can use, say, TreeClusteringRecommender to
generate user clusters, based on data in a database. This isn't

Mahout's primary clustering support, but, if it fits what you need,at

least it is there.

On Fri, Jun 26, 2009 at 12:21 PM, nfantone<[email protected]> wrote:

Thanks for the fast response, Grant.

I am aware of what you pointed out about Taste. I just mentionedit to

make a reference to something similar to what I needed to
implement/use, namely the "DataModel" interface.

I'm going to try the solution you suggested and write an
implementation of VectorIterable. Expect me to come back here for
feedback.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Clustering from DB

Reply via email to