Thanks for your input.

The problem wasn't the high dimensional space itself but the cluster
initialization. I validated the document cosine distance and they look
fairly well distributed.

I now use canopy in a pre-clustering step. Interestingly, canopy
suggests to use a large number of clusters, which might makes sense
since the a lot of documents are unrelated due to their sparse word
vector. If I reduce the number of clusters, a lot documents remain
unclustered in the center of the cluster space.
Further I would like to note that the random cluster initializations
tends to choose initial centers that are close to each other. For some
reasons this leads to overlapping or even identical clusters.

The problem of parameter tuning (T1 and T2) for canopy remains. However,
I assume their is no general strategy on this problem.

Cheers
Sebastian

Am 27.03.2013 06:43, schrieb Dan Filimon:
> Ah, so Ted, it looks like there's a bug with the mapreduce after all then.
>
> Pity, I liked the higher dimensionality argument but thinking it through, it 
> doesn't make that much sense.
>
> On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> Reducing to a lower dimensional space is a convenience, no more.
>>
>> Clustering in the original space is fine.  I still have trouble with your
>> normalizing before weighting, but I don't know what effect that will have
>> on anything.  It certainly will interfere with the interpretation of the
>> cosine metrics.
>>
>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
>> sebastian.briesemeis...@unister.de> wrote:
>>
>>> I am not quite sure whether this will solve the problem, though I will try
>>> it of course.
>>>
>>> I always thought that clustering documents based on their words is a
>>> common problem and is usually tackled in the word space and not in a
>>> reduced one.
>>> Besides the distances look reasonable. Still I end up with very similar
>>> and very distant documents unclustered in the middle of all clusters.
>>>
>>> So I think the problem lies in the clustering method not in the distances.
>>>
>>>
>>>
>>> Dan Filimon <dangeorge.fili...@gmail.com> schrieb:
>>>
>>>> So you're clustering 90K dimensional data?
>>>>
>>>> I'm faced with a very similar problem as you (working on
>>>> StreamingKMeans for mahout) and from what I read [1], the problem
>>>> might be that in very high dimensional spaces the distances become
>>>> meaningless.
>>>>
>>>> I'm pretty sure this is the case and I was considering implementing
>>>> the test mentioned in the paper (also I feel like it's a very useful
>>>> algorithm to have).
>>>>
>>>> In any case, since the vectors are so sparse, why not reduce their
>>>> dimension?
>>>>
>>>> You can try principal component analysis (just getting the first k
>>>> eigenvectors in the singular value decomposition of the matrix that
>>>> has your vectors as rows). The class that does this is SSVDSolver
>>>> (there's also SingularValueDecomposition but that tries making dense
>>>> matrices and those might not fit into memory. I've never personally
>>>> used it though.
>>>> Once you have the first k eigenvectors of size n, make them rows in a
>>>> matrix (U) and multiply each vector x you have with it (U x) getting a
>>>> reduced vector.
>>>>
>>>> Or, use random projections to reduce the size of the data set. You
>>>> want to create a matrix whose entries are sampled from a uniform
>>>> distribution (0, 1) (Functions.random in
>>>> o.a.m.math.function.Functions), normalize its rows and multiply each
>>>> vector x with it.
>>>>
>>>> So, reduce the size of your vectors thereby making the dimensionality
>>>> less of a problem and you'll get a decent approximation (you can
>>>> actually quantify how good it is with SVD). From what I've seen, the
>>>> clusters separate at smaller dimensions but there's the question of
>>>> how good an approximation of the uncompressed data you have.
>>>>
>>>> See if this helps, I need to do the same thing :)
>>>>
>>>> What do you think?
>>>>
>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>>>
>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
>>>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>>>> words. However, each document contains usually only about 10 to 20
>>>>> words. Only some contain more than 1000 words.
>>>>>
>>>>> For each document, I set a field in the corresponding vector to 1 if
>>>> it
>>>>> contains a word. Then I normalize each vector using the L2-norm.
>>>>> Finally I multiply each element (representing a word) in the vector
>>>> by
>>>>> log(#documents/#documents_with_word).
>>>>>
>>>>> For clustering, I am using cosine similarity.
>>>>>
>>>>> Regards
>>>>> Sebastian
>>>>>
>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>>> Hi,
>>>>>>
>>>>>> Could you tell us more about the kind of data you're clustering?
>>>> What
>>>>>> distance measure you're using and what the dimensionality of the
>>>> data
>>>>>> is?
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote:
>>>>>>> Dear Mahout-users,
>>>>>>>
>>>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>>> c
>>>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>>>
>>>>>>> 1.) I always end up with one large set of rubbish instances. All of
>>>> them
>>>>>>> have uniform cluster probability distribution and are, hence, in
>>>> the
>>>>>>> exact middle of the cluster space.
>>>>>>> The cosine distance between instances within this cluster reaches
>>>> from 0
>>>>>>> to 1.
>>>>>>>
>>>>>>> 2.) Some of my clusters have the same or a very very similar
>>>> center.
>>>>>>> Besides the above described problems, the clustering seems to work
>>>> fine.
>>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>>>
>>>>>>> Regards
>>>>>>> Sebastian
>>> --
>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>>> gesendet.

Reply via email to