Reducing to a lower dimensional space is a convenience, no more.

Clustering in the original space is fine.  I still have trouble with your
normalizing before weighting, but I don't know what effect that will have
on anything.  It certainly will interfere with the interpretation of the
cosine metrics.

On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
sebastian.briesemeis...@unister.de> wrote:

> I am not quite sure whether this will solve the problem, though I will try
> it of course.
>
> I always thought that clustering documents based on their words is a
> common problem and is usually tackled in the word space and not in a
> reduced one.
> Besides the distances look reasonable. Still I end up with very similar
> and very distant documents unclustered in the middle of all clusters.
>
> So I think the problem lies in the clustering method not in the distances.
>
>
>
> Dan Filimon <dangeorge.fili...@gmail.com> schrieb:
>
> >So you're clustering 90K dimensional data?
> >
> >I'm faced with a very similar problem as you (working on
> >StreamingKMeans for mahout) and from what I read [1], the problem
> >might be that in very high dimensional spaces the distances become
> >meaningless.
> >
> >I'm pretty sure this is the case and I was considering implementing
> >the test mentioned in the paper (also I feel like it's a very useful
> >algorithm to have).
> >
> >In any case, since the vectors are so sparse, why not reduce their
> >dimension?
> >
> >You can try principal component analysis (just getting the first k
> >eigenvectors in the singular value decomposition of the matrix that
> >has your vectors as rows). The class that does this is SSVDSolver
> >(there's also SingularValueDecomposition but that tries making dense
> >matrices and those might not fit into memory. I've never personally
> >used it though.
> >Once you have the first k eigenvectors of size n, make them rows in a
> >matrix (U) and multiply each vector x you have with it (U x) getting a
> >reduced vector.
> >
> >Or, use random projections to reduce the size of the data set. You
> >want to create a matrix whose entries are sampled from a uniform
> >distribution (0, 1) (Functions.random in
> >o.a.m.math.function.Functions), normalize its rows and multiply each
> >vector x with it.
> >
> >So, reduce the size of your vectors thereby making the dimensionality
> >less of a problem and you'll get a decent approximation (you can
> >actually quantify how good it is with SVD). From what I've seen, the
> >clusters separate at smaller dimensions but there's the question of
> >how good an approximation of the uncompressed data you have.
> >
> >See if this helps, I need to do the same thing :)
> >
> >What do you think?
> >
> >[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >
> >On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> ><sebastian.briesemeis...@unister-gmbh.de> wrote:
> >> The dataset consists of about 4000 documents and is encoded by 90.000
> >> words. However, each document contains usually only about 10 to 20
> >> words. Only some contain more than 1000 words.
> >>
> >> For each document, I set a field in the corresponding vector to 1 if
> >it
> >> contains a word. Then I normalize each vector using the L2-norm.
> >> Finally I multiply each element (representing a word) in the vector
> >by
> >> log(#documents/#documents_with_word).
> >>
> >> For clustering, I am using cosine similarity.
> >>
> >> Regards
> >> Sebastian
> >>
> >> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>> Hi,
> >>>
> >>> Could you tell us more about the kind of data you're clustering?
> >What
> >>> distance measure you're using and what the dimensionality of the
> >data
> >>> is?
> >>>
> >>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
> >>>> Dear Mahout-users,
> >>>>
> >>>> I am facing two problems when I am clustering instances with Fuzzy
> >c
> >>>> Means clustering (cosine distance, random initial clustering):
> >>>>
> >>>> 1.) I always end up with one large set of rubbish instances. All of
> >them
> >>>> have uniform cluster probability distribution and are, hence, in
> >the
> >>>> exact middle of the cluster space.
> >>>> The cosine distance between instances within this cluster reaches
> >from 0
> >>>> to 1.
> >>>>
> >>>> 2.) Some of my clusters have the same or a very very similar
> >center.
> >>>>
> >>>> Besides the above described problems, the clustering seems to work
> >fine.
> >>>>
> >>>> Has somebody an idea how my clustering can be improved?
> >>>>
> >>>> Regards
> >>>> Sebastian
> >>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.

Reply via email to