Thanks a lot Kate. A followup question I have is how to use the clusters ?
Do you know what API i should follow to load the generated cluster file and
send query to them (send some document id and gets the cluster id and its
the other documents in the cluster)?

Thanks a lot,

Weide

On Mon, Oct 3, 2011 at 11:10 AM, Kate Ericson <[email protected]>wrote:

> Hi Weide,
>
> Does this mean you have only 60 data points you are trying to cluster?
> This may be part of why it seems to be running so quickly.
> the k flag tells the program how many points to cluster around, so
> having k=3 means you are trying to group your data into 3 clusters.
> As for the folder names, after every iteration of clustering kmeans
> writes out the final cluster positions.  If you hit the max number of
> iterations, or the cluster centers don't move more than a
> predetermined distance the clustering function is stopped.
> Since you have clusters-1 and clusters-2 folders, this means it ran
> for only 2 iterations.
> It looks like you set the max iterations to 1000 (-x 1000), so it's
> definitely hitting the point where your cluster centers are no longer
> moving more than the minimum amount (-cd 1.0).
> You may want to try with a higher k - maybe 10 and see how many
> iterations it goes though.  Another thing to look at is how the
> initial clusters are chosen.  By default, the starting clusters are
> randomly chosen.  Working with the Canopy Clustering program may let
> you find better initial clusters.
>
> Hope this helps,
>
> -Kate
>
> On Mon, Oct 3, 2011 at 11:38 AM, Walter Chang <[email protected]>
> wrote:
> > Hi Kate,
> >
> > I have 60 rows data that has text description. I just generated tf-idf
> using
> > my analyzer. and tf-idf vector is passed into the clustering algorithms
> to
> > do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
> > What does each folder mean ?  How does the clustering process generates
> > those ?
> >
> > Weide
> >
> > On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <[email protected]
> >wrote:
> >
> >> Hi Welde,
> >>
> >> As a disclaimer, I only know enough to try to help you figure out your
> >> first problem.
> >> First of all, can you tell us about the dataset you are using?
> >> How many points are you clustering?
> >>
> >> As a guess without knowing either of these things, part of the reason
> >> why your clusters look the same is that you're only clustering around
> >> 3 points.  You're only running for 2 iterations, so it looks like its
> >> just not moving your cluster centers around at all.  Can you try again
> >> with a larger k?
> >> This may let it run for more iterations so you should be able to see
> >> more changes in results.
> >>
> >> Good luck!
> >>
> >> -Kate
> >>
> >> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <[email protected]>
> >> wrote:
> >> > Hi ,
> >> >
> >> > i have used mahout to produce kmeans  clustering for my tf-idf result.
> I
> >> use
> >> > the mahout command line to produce the clusters and it seems it
> >> successfully
> >> > completes.
> >> >
> >> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c
> ./initialclusters
> >> -o
> >> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >> >
> >> > It seems there are two clusters directory generated.(cluster-1 and
> >> > cluster-2)  , when i use clusterdump on each of them, it seems to me
> that
> >> > the clustered top terms are the same. Any idea why ?
> >> >
> >> > Also, how can i see which documents have been assigned to each
> cluster.
> >> > Right now, i can see the number of documents assigned but not the
> >> complete
> >> > list.
> >> >
> >> > Most importantly, for production purposes, i assume it makes sense for
> >> > kmeans always runs on hadoop to generate the clustering file. But how
> do
> >> i
> >> > consume these during serving ? Ideally, serving should have the doc id
> or
> >> > query passed as a query, and the server should return the top document
> >> > ranked by the score within the same cluster back. How do I do it in
> code
> >> ?
> >> > Any good examples ?
> >> >
> >> > Thanks a lot,
> >> >
> >> > Weide
> >> >
> >>
> >
>

Reply via email to