Re: How to improve clustering?

Ted Dunning Thu, 28 Mar 2013 01:10:10 -0700

It makes hard cluster assignments, but that would be helpful two ways:

a) it will help you diagnose data issues


b) it can produce good starting points for fuzzy k-means.

On Thu, Mar 28, 2013 at 7:19 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> Sebastian, if you're interested I'd be glad to walk you through the main
> ideas, point you to the code and tell you  how to run it.
> Testing it on more data would be very helpful the project.
>
> But, it makes hard cluster assignments.
>
> On Mar 28, 2013, at 2:23, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > The streaming k-means stuff is what Dan has been working on.
> >
> > On Wed, Mar 27, 2013 at 6:14 PM, Sebastian Briesemeister <
> > sebastian.briesemeis...@unister.de> wrote:
> >
> >> I did change it as you suggested. Now I normalize after the frequency
> >> weighting.
> >>
> >> The results from non fuzzy clustering are similar, but I require
> >> probabilities though.
> >>
> >> Streaming k-means stuff? I don't get you here.
> >>
> >>
> >>
> >> Ted Dunning <ted.dunn...@gmail.com> schrieb:
> >>
> >>> I think you should change your vector preparation method.
> >>>
> >>> What kind of results do you get from non-fuzzy clustering?
> >>>
> >>> What about from the streaming k-means stuff?
> >>>
> >>> On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
> >>> sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>
> >>>> Thanks for your input.
> >>>>
> >>>> The problem wasn't the high dimensional space itself but the cluster
> >>>> initialization. I validated the document cosine distance and they
> >>> look
> >>>> fairly well distributed.
> >>>>
> >>>> I now use canopy in a pre-clustering step. Interestingly, canopy
> >>>> suggests to use a large number of clusters, which might makes sense
> >>>> since the a lot of documents are unrelated due to their sparse word
> >>>> vector. If I reduce the number of clusters, a lot documents remain
> >>>> unclustered in the center of the cluster space.
> >>>> Further I would like to note that the random cluster initializations
> >>>> tends to choose initial centers that are close to each other. For
> >>> some
> >>>> reasons this leads to overlapping or even identical clusters.
> >>>>
> >>>> The problem of parameter tuning (T1 and T2) for canopy remains.
> >>> However,
> >>>> I assume their is no general strategy on this problem.
> >>>>
> >>>> Cheers
> >>>> Sebastian
> >>>>
> >>>> Am 27.03.2013 06:43, schrieb Dan Filimon:
> >>>>> Ah, so Ted, it looks like there's a bug with the mapreduce after
> >>> all
> >>>> then.
> >>>>>
> >>>>> Pity, I liked the higher dimensionality argument but thinking it
> >>>> through, it doesn't make that much sense.
> >>>>>
> >>>>> On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunn...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Reducing to a lower dimensional space is a convenience, no more.
> >>>>>>
> >>>>>> Clustering in the original space is fine.  I still have trouble
> >>> with
> >>>> your
> >>>>>> normalizing before weighting, but I don't know what effect that
> >>> will
> >>>> have
> >>>>>> on anything.  It certainly will interfere with the interpretation
> >>> of the
> >>>>>> cosine metrics.
> >>>>>>
> >>>>>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> >>>>>> sebastian.briesemeis...@unister.de> wrote:
> >>>>>>
> >>>>>>> I am not quite sure whether this will solve the problem, though I
> >>> will
> >>>> try
> >>>>>>> it of course.
> >>>>>>>
> >>>>>>> I always thought that clustering documents based on their words
> >>> is a
> >>>>>>> common problem and is usually tackled in the word space and not
> >>> in a
> >>>>>>> reduced one.
> >>>>>>> Besides the distances look reasonable. Still I end up with very
> >>> similar
> >>>>>>> and very distant documents unclustered in the middle of all
> >>> clusters.
> >>>>>>>
> >>>>>>> So I think the problem lies in the clustering method not in the
> >>>> distances.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Dan Filimon <dangeorge.fili...@gmail.com> schrieb:
> >>>>>>>
> >>>>>>>> So you're clustering 90K dimensional data?
> >>>>>>>>
> >>>>>>>> I'm faced with a very similar problem as you (working on
> >>>>>>>> StreamingKMeans for mahout) and from what I read [1], the
> >>> problem
> >>>>>>>> might be that in very high dimensional spaces the distances
> >>> become
> >>>>>>>> meaningless.
> >>>>>>>>
> >>>>>>>> I'm pretty sure this is the case and I was considering
> >>> implementing
> >>>>>>>> the test mentioned in the paper (also I feel like it's a very
> >>> useful
> >>>>>>>> algorithm to have).
> >>>>>>>>
> >>>>>>>> In any case, since the vectors are so sparse, why not reduce
> >>> their
> >>>>>>>> dimension?
> >>>>>>>>
> >>>>>>>> You can try principal component analysis (just getting the first
> >>> k
> >>>>>>>> eigenvectors in the singular value decomposition of the matrix
> >>> that
> >>>>>>>> has your vectors as rows). The class that does this is
> >>> SSVDSolver
> >>>>>>>> (there's also SingularValueDecomposition but that tries making
> >>> dense
> >>>>>>>> matrices and those might not fit into memory. I've never
> >>> personally
> >>>>>>>> used it though.
> >>>>>>>> Once you have the first k eigenvectors of size n, make them rows
> >>> in a
> >>>>>>>> matrix (U) and multiply each vector x you have with it (U x)
> >>> getting a
> >>>>>>>> reduced vector.
> >>>>>>>>
> >>>>>>>> Or, use random projections to reduce the size of the data set.
> >>> You
> >>>>>>>> want to create a matrix whose entries are sampled from a uniform
> >>>>>>>> distribution (0, 1) (Functions.random in
> >>>>>>>> o.a.m.math.function.Functions), normalize its rows and multiply
> >>> each
> >>>>>>>> vector x with it.
> >>>>>>>>
> >>>>>>>> So, reduce the size of your vectors thereby making the
> >>> dimensionality
> >>>>>>>> less of a problem and you'll get a decent approximation (you can
> >>>>>>>> actually quantify how good it is with SVD). From what I've seen,
> >>> the
> >>>>>>>> clusters separate at smaller dimensions but there's the question
> >>> of
> >>>>>>>> how good an approximation of the uncompressed data you have.
> >>>>>>>>
> >>>>>>>> See if this helps, I need to do the same thing :)
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >>>>>>>>
> >>>>>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >>>>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>>>>>>> The dataset consists of about 4000 documents and is encoded by
> >>> 90.000
> >>>>>>>>> words. However, each document contains usually only about 10 to
> >>> 20
> >>>>>>>>> words. Only some contain more than 1000 words.
> >>>>>>>>>
> >>>>>>>>> For each document, I set a field in the corresponding vector to
> >>> 1 if
> >>>>>>>> it
> >>>>>>>>> contains a word. Then I normalize each vector using the
> >>> L2-norm.
> >>>>>>>>> Finally I multiply each element (representing a word) in the
> >>> vector
> >>>>>>>> by
> >>>>>>>>> log(#documents/#documents_with_word).
> >>>>>>>>>
> >>>>>>>>> For clustering, I am using cosine similarity.
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>> Sebastian
> >>>>>>>>>
> >>>>>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Could you tell us more about the kind of data you're
> >>> clustering?
> >>>>>>>> What
> >>>>>>>>>> distance measure you're using and what the dimensionality of
> >>> the
> >>>>>>>> data
> >>>>>>>>>> is?
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>>>>>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>>>>>>>>> Dear Mahout-users,
> >>>>>>>>>>>
> >>>>>>>>>>> I am facing two problems when I am clustering instances with
> >>> Fuzzy
> >>>>>>>> c
> >>>>>>>>>>> Means clustering (cosine distance, random initial
> >>> clustering):
> >>>>>>>>>>>
> >>>>>>>>>>> 1.) I always end up with one large set of rubbish instances.
> >>> All of
> >>>>>>>> them
> >>>>>>>>>>> have uniform cluster probability distribution and are, hence,
> >>> in
> >>>>>>>> the
> >>>>>>>>>>> exact middle of the cluster space.
> >>>>>>>>>>> The cosine distance between instances within this cluster
> >>> reaches
> >>>>>>>> from 0
> >>>>>>>>>>> to 1.
> >>>>>>>>>>>
> >>>>>>>>>>> 2.) Some of my clusters have the same or a very very similar
> >>>>>>>> center.
> >>>>>>>>>>> Besides the above described problems, the clustering seems to
> >>> work
> >>>>>>>> fine.
> >>>>>>>>>>> Has somebody an idea how my clustering can be improved?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards
> >>>>>>>>>>> Sebastian
> >>>>>>> --
> >>>>>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9
> >>> Mail
> >>>>>>> gesendet.
> >>
> >> --
> >> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> >> gesendet.
> >>
>

Re: How to improve clustering?

Reply via email to