I think you should change your vector preparation method.

What kind of results do you get from non-fuzzy clustering?

What about from the streaming k-means stuff?

On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
sebastian.briesemeis...@unister-gmbh.de> wrote:

> Thanks for your input.
>
> The problem wasn't the high dimensional space itself but the cluster
> initialization. I validated the document cosine distance and they look
> fairly well distributed.
>
> I now use canopy in a pre-clustering step. Interestingly, canopy
> suggests to use a large number of clusters, which might makes sense
> since the a lot of documents are unrelated due to their sparse word
> vector. If I reduce the number of clusters, a lot documents remain
> unclustered in the center of the cluster space.
> Further I would like to note that the random cluster initializations
> tends to choose initial centers that are close to each other. For some
> reasons this leads to overlapping or even identical clusters.
>
> The problem of parameter tuning (T1 and T2) for canopy remains. However,
> I assume their is no general strategy on this problem.
>
> Cheers
> Sebastian
>
> Am 27.03.2013 06:43, schrieb Dan Filimon:
> > Ah, so Ted, it looks like there's a bug with the mapreduce after all
> then.
> >
> > Pity, I liked the higher dimensionality argument but thinking it
> through, it doesn't make that much sense.
> >
> > On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >
> >> Reducing to a lower dimensional space is a convenience, no more.
> >>
> >> Clustering in the original space is fine.  I still have trouble with
> your
> >> normalizing before weighting, but I don't know what effect that will
> have
> >> on anything.  It certainly will interfere with the interpretation of the
> >> cosine metrics.
> >>
> >> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> >> sebastian.briesemeis...@unister.de> wrote:
> >>
> >>> I am not quite sure whether this will solve the problem, though I will
> try
> >>> it of course.
> >>>
> >>> I always thought that clustering documents based on their words is a
> >>> common problem and is usually tackled in the word space and not in a
> >>> reduced one.
> >>> Besides the distances look reasonable. Still I end up with very similar
> >>> and very distant documents unclustered in the middle of all clusters.
> >>>
> >>> So I think the problem lies in the clustering method not in the
> distances.
> >>>
> >>>
> >>>
> >>> Dan Filimon <dangeorge.fili...@gmail.com> schrieb:
> >>>
> >>>> So you're clustering 90K dimensional data?
> >>>>
> >>>> I'm faced with a very similar problem as you (working on
> >>>> StreamingKMeans for mahout) and from what I read [1], the problem
> >>>> might be that in very high dimensional spaces the distances become
> >>>> meaningless.
> >>>>
> >>>> I'm pretty sure this is the case and I was considering implementing
> >>>> the test mentioned in the paper (also I feel like it's a very useful
> >>>> algorithm to have).
> >>>>
> >>>> In any case, since the vectors are so sparse, why not reduce their
> >>>> dimension?
> >>>>
> >>>> You can try principal component analysis (just getting the first k
> >>>> eigenvectors in the singular value decomposition of the matrix that
> >>>> has your vectors as rows). The class that does this is SSVDSolver
> >>>> (there's also SingularValueDecomposition but that tries making dense
> >>>> matrices and those might not fit into memory. I've never personally
> >>>> used it though.
> >>>> Once you have the first k eigenvectors of size n, make them rows in a
> >>>> matrix (U) and multiply each vector x you have with it (U x) getting a
> >>>> reduced vector.
> >>>>
> >>>> Or, use random projections to reduce the size of the data set. You
> >>>> want to create a matrix whose entries are sampled from a uniform
> >>>> distribution (0, 1) (Functions.random in
> >>>> o.a.m.math.function.Functions), normalize its rows and multiply each
> >>>> vector x with it.
> >>>>
> >>>> So, reduce the size of your vectors thereby making the dimensionality
> >>>> less of a problem and you'll get a decent approximation (you can
> >>>> actually quantify how good it is with SVD). From what I've seen, the
> >>>> clusters separate at smaller dimensions but there's the question of
> >>>> how good an approximation of the uncompressed data you have.
> >>>>
> >>>> See if this helps, I need to do the same thing :)
> >>>>
> >>>> What do you think?
> >>>>
> >>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >>>>
> >>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>>> The dataset consists of about 4000 documents and is encoded by 90.000
> >>>>> words. However, each document contains usually only about 10 to 20
> >>>>> words. Only some contain more than 1000 words.
> >>>>>
> >>>>> For each document, I set a field in the corresponding vector to 1 if
> >>>> it
> >>>>> contains a word. Then I normalize each vector using the L2-norm.
> >>>>> Finally I multiply each element (representing a word) in the vector
> >>>> by
> >>>>> log(#documents/#documents_with_word).
> >>>>>
> >>>>> For clustering, I am using cosine similarity.
> >>>>>
> >>>>> Regards
> >>>>> Sebastian
> >>>>>
> >>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Could you tell us more about the kind of data you're clustering?
> >>>> What
> >>>>>> distance measure you're using and what the dimensionality of the
> >>>> data
> >>>>>> is?
> >>>>>>
> >>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>>>>> Dear Mahout-users,
> >>>>>>>
> >>>>>>> I am facing two problems when I am clustering instances with Fuzzy
> >>>> c
> >>>>>>> Means clustering (cosine distance, random initial clustering):
> >>>>>>>
> >>>>>>> 1.) I always end up with one large set of rubbish instances. All of
> >>>> them
> >>>>>>> have uniform cluster probability distribution and are, hence, in
> >>>> the
> >>>>>>> exact middle of the cluster space.
> >>>>>>> The cosine distance between instances within this cluster reaches
> >>>> from 0
> >>>>>>> to 1.
> >>>>>>>
> >>>>>>> 2.) Some of my clusters have the same or a very very similar
> >>>> center.
> >>>>>>> Besides the above described problems, the clustering seems to work
> >>>> fine.
> >>>>>>> Has somebody an idea how my clustering can be improved?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> Sebastian
> >>> --
> >>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> >>> gesendet.
>
>

Reply via email to