I think you should change your vector preparation method. What kind of results do you get from non-fuzzy clustering?
What about from the streaming k-means stuff? On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister < sebastian.briesemeis...@unister-gmbh.de> wrote: > Thanks for your input. > > The problem wasn't the high dimensional space itself but the cluster > initialization. I validated the document cosine distance and they look > fairly well distributed. > > I now use canopy in a pre-clustering step. Interestingly, canopy > suggests to use a large number of clusters, which might makes sense > since the a lot of documents are unrelated due to their sparse word > vector. If I reduce the number of clusters, a lot documents remain > unclustered in the center of the cluster space. > Further I would like to note that the random cluster initializations > tends to choose initial centers that are close to each other. For some > reasons this leads to overlapping or even identical clusters. > > The problem of parameter tuning (T1 and T2) for canopy remains. However, > I assume their is no general strategy on this problem. > > Cheers > Sebastian > > Am 27.03.2013 06:43, schrieb Dan Filimon: > > Ah, so Ted, it looks like there's a bug with the mapreduce after all > then. > > > > Pity, I liked the higher dimensionality argument but thinking it > through, it doesn't make that much sense. > > > > On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > >> Reducing to a lower dimensional space is a convenience, no more. > >> > >> Clustering in the original space is fine. I still have trouble with > your > >> normalizing before weighting, but I don't know what effect that will > have > >> on anything. It certainly will interfere with the interpretation of the > >> cosine metrics. > >> > >> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister < > >> sebastian.briesemeis...@unister.de> wrote: > >> > >>> I am not quite sure whether this will solve the problem, though I will > try > >>> it of course. > >>> > >>> I always thought that clustering documents based on their words is a > >>> common problem and is usually tackled in the word space and not in a > >>> reduced one. > >>> Besides the distances look reasonable. Still I end up with very similar > >>> and very distant documents unclustered in the middle of all clusters. > >>> > >>> So I think the problem lies in the clustering method not in the > distances. > >>> > >>> > >>> > >>> Dan Filimon <dangeorge.fili...@gmail.com> schrieb: > >>> > >>>> So you're clustering 90K dimensional data? > >>>> > >>>> I'm faced with a very similar problem as you (working on > >>>> StreamingKMeans for mahout) and from what I read [1], the problem > >>>> might be that in very high dimensional spaces the distances become > >>>> meaningless. > >>>> > >>>> I'm pretty sure this is the case and I was considering implementing > >>>> the test mentioned in the paper (also I feel like it's a very useful > >>>> algorithm to have). > >>>> > >>>> In any case, since the vectors are so sparse, why not reduce their > >>>> dimension? > >>>> > >>>> You can try principal component analysis (just getting the first k > >>>> eigenvectors in the singular value decomposition of the matrix that > >>>> has your vectors as rows). The class that does this is SSVDSolver > >>>> (there's also SingularValueDecomposition but that tries making dense > >>>> matrices and those might not fit into memory. I've never personally > >>>> used it though. > >>>> Once you have the first k eigenvectors of size n, make them rows in a > >>>> matrix (U) and multiply each vector x you have with it (U x) getting a > >>>> reduced vector. > >>>> > >>>> Or, use random projections to reduce the size of the data set. You > >>>> want to create a matrix whose entries are sampled from a uniform > >>>> distribution (0, 1) (Functions.random in > >>>> o.a.m.math.function.Functions), normalize its rows and multiply each > >>>> vector x with it. > >>>> > >>>> So, reduce the size of your vectors thereby making the dimensionality > >>>> less of a problem and you'll get a decent approximation (you can > >>>> actually quantify how good it is with SVD). From what I've seen, the > >>>> clusters separate at smaller dimensions but there's the question of > >>>> how good an approximation of the uncompressed data you have. > >>>> > >>>> See if this helps, I need to do the same thing :) > >>>> > >>>> What do you think? > >>>> > >>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf > >>>> > >>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister > >>>> <sebastian.briesemeis...@unister-gmbh.de> wrote: > >>>>> The dataset consists of about 4000 documents and is encoded by 90.000 > >>>>> words. However, each document contains usually only about 10 to 20 > >>>>> words. Only some contain more than 1000 words. > >>>>> > >>>>> For each document, I set a field in the corresponding vector to 1 if > >>>> it > >>>>> contains a word. Then I normalize each vector using the L2-norm. > >>>>> Finally I multiply each element (representing a word) in the vector > >>>> by > >>>>> log(#documents/#documents_with_word). > >>>>> > >>>>> For clustering, I am using cosine similarity. > >>>>> > >>>>> Regards > >>>>> Sebastian > >>>>> > >>>>> Am 26.03.2013 17:33, schrieb Dan Filimon: > >>>>>> Hi, > >>>>>> > >>>>>> Could you tell us more about the kind of data you're clustering? > >>>> What > >>>>>> distance measure you're using and what the dimensionality of the > >>>> data > >>>>>> is? > >>>>>> > >>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister > >>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote: > >>>>>>> Dear Mahout-users, > >>>>>>> > >>>>>>> I am facing two problems when I am clustering instances with Fuzzy > >>>> c > >>>>>>> Means clustering (cosine distance, random initial clustering): > >>>>>>> > >>>>>>> 1.) I always end up with one large set of rubbish instances. All of > >>>> them > >>>>>>> have uniform cluster probability distribution and are, hence, in > >>>> the > >>>>>>> exact middle of the cluster space. > >>>>>>> The cosine distance between instances within this cluster reaches > >>>> from 0 > >>>>>>> to 1. > >>>>>>> > >>>>>>> 2.) Some of my clusters have the same or a very very similar > >>>> center. > >>>>>>> Besides the above described problems, the clustering seems to work > >>>> fine. > >>>>>>> Has somebody an idea how my clustering can be improved? > >>>>>>> > >>>>>>> Regards > >>>>>>> Sebastian > >>> -- > >>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail > >>> gesendet. > >