It makes hard cluster assignments, but that would be helpful two ways: a) it will help you diagnose data issues
b) it can produce good starting points for fuzzy k-means. On Thu, Mar 28, 2013 at 7:19 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote: > Sebastian, if you're interested I'd be glad to walk you through the main > ideas, point you to the code and tell you how to run it. > Testing it on more data would be very helpful the project. > > But, it makes hard cluster assignments. > > On Mar 28, 2013, at 2:23, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > The streaming k-means stuff is what Dan has been working on. > > > > On Wed, Mar 27, 2013 at 6:14 PM, Sebastian Briesemeister < > > sebastian.briesemeis...@unister.de> wrote: > > > >> I did change it as you suggested. Now I normalize after the frequency > >> weighting. > >> > >> The results from non fuzzy clustering are similar, but I require > >> probabilities though. > >> > >> Streaming k-means stuff? I don't get you here. > >> > >> > >> > >> Ted Dunning <ted.dunn...@gmail.com> schrieb: > >> > >>> I think you should change your vector preparation method. > >>> > >>> What kind of results do you get from non-fuzzy clustering? > >>> > >>> What about from the streaming k-means stuff? > >>> > >>> On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister < > >>> sebastian.briesemeis...@unister-gmbh.de> wrote: > >>> > >>>> Thanks for your input. > >>>> > >>>> The problem wasn't the high dimensional space itself but the cluster > >>>> initialization. I validated the document cosine distance and they > >>> look > >>>> fairly well distributed. > >>>> > >>>> I now use canopy in a pre-clustering step. Interestingly, canopy > >>>> suggests to use a large number of clusters, which might makes sense > >>>> since the a lot of documents are unrelated due to their sparse word > >>>> vector. If I reduce the number of clusters, a lot documents remain > >>>> unclustered in the center of the cluster space. > >>>> Further I would like to note that the random cluster initializations > >>>> tends to choose initial centers that are close to each other. For > >>> some > >>>> reasons this leads to overlapping or even identical clusters. > >>>> > >>>> The problem of parameter tuning (T1 and T2) for canopy remains. > >>> However, > >>>> I assume their is no general strategy on this problem. > >>>> > >>>> Cheers > >>>> Sebastian > >>>> > >>>> Am 27.03.2013 06:43, schrieb Dan Filimon: > >>>>> Ah, so Ted, it looks like there's a bug with the mapreduce after > >>> all > >>>> then. > >>>>> > >>>>> Pity, I liked the higher dimensionality argument but thinking it > >>>> through, it doesn't make that much sense. > >>>>> > >>>>> On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunn...@gmail.com> > >>> wrote: > >>>>> > >>>>>> Reducing to a lower dimensional space is a convenience, no more. > >>>>>> > >>>>>> Clustering in the original space is fine. I still have trouble > >>> with > >>>> your > >>>>>> normalizing before weighting, but I don't know what effect that > >>> will > >>>> have > >>>>>> on anything. It certainly will interfere with the interpretation > >>> of the > >>>>>> cosine metrics. > >>>>>> > >>>>>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister < > >>>>>> sebastian.briesemeis...@unister.de> wrote: > >>>>>> > >>>>>>> I am not quite sure whether this will solve the problem, though I > >>> will > >>>> try > >>>>>>> it of course. > >>>>>>> > >>>>>>> I always thought that clustering documents based on their words > >>> is a > >>>>>>> common problem and is usually tackled in the word space and not > >>> in a > >>>>>>> reduced one. > >>>>>>> Besides the distances look reasonable. Still I end up with very > >>> similar > >>>>>>> and very distant documents unclustered in the middle of all > >>> clusters. > >>>>>>> > >>>>>>> So I think the problem lies in the clustering method not in the > >>>> distances. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Dan Filimon <dangeorge.fili...@gmail.com> schrieb: > >>>>>>> > >>>>>>>> So you're clustering 90K dimensional data? > >>>>>>>> > >>>>>>>> I'm faced with a very similar problem as you (working on > >>>>>>>> StreamingKMeans for mahout) and from what I read [1], the > >>> problem > >>>>>>>> might be that in very high dimensional spaces the distances > >>> become > >>>>>>>> meaningless. > >>>>>>>> > >>>>>>>> I'm pretty sure this is the case and I was considering > >>> implementing > >>>>>>>> the test mentioned in the paper (also I feel like it's a very > >>> useful > >>>>>>>> algorithm to have). > >>>>>>>> > >>>>>>>> In any case, since the vectors are so sparse, why not reduce > >>> their > >>>>>>>> dimension? > >>>>>>>> > >>>>>>>> You can try principal component analysis (just getting the first > >>> k > >>>>>>>> eigenvectors in the singular value decomposition of the matrix > >>> that > >>>>>>>> has your vectors as rows). The class that does this is > >>> SSVDSolver > >>>>>>>> (there's also SingularValueDecomposition but that tries making > >>> dense > >>>>>>>> matrices and those might not fit into memory. I've never > >>> personally > >>>>>>>> used it though. > >>>>>>>> Once you have the first k eigenvectors of size n, make them rows > >>> in a > >>>>>>>> matrix (U) and multiply each vector x you have with it (U x) > >>> getting a > >>>>>>>> reduced vector. > >>>>>>>> > >>>>>>>> Or, use random projections to reduce the size of the data set. > >>> You > >>>>>>>> want to create a matrix whose entries are sampled from a uniform > >>>>>>>> distribution (0, 1) (Functions.random in > >>>>>>>> o.a.m.math.function.Functions), normalize its rows and multiply > >>> each > >>>>>>>> vector x with it. > >>>>>>>> > >>>>>>>> So, reduce the size of your vectors thereby making the > >>> dimensionality > >>>>>>>> less of a problem and you'll get a decent approximation (you can > >>>>>>>> actually quantify how good it is with SVD). From what I've seen, > >>> the > >>>>>>>> clusters separate at smaller dimensions but there's the question > >>> of > >>>>>>>> how good an approximation of the uncompressed data you have. > >>>>>>>> > >>>>>>>> See if this helps, I need to do the same thing :) > >>>>>>>> > >>>>>>>> What do you think? > >>>>>>>> > >>>>>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf > >>>>>>>> > >>>>>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister > >>>>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote: > >>>>>>>>> The dataset consists of about 4000 documents and is encoded by > >>> 90.000 > >>>>>>>>> words. However, each document contains usually only about 10 to > >>> 20 > >>>>>>>>> words. Only some contain more than 1000 words. > >>>>>>>>> > >>>>>>>>> For each document, I set a field in the corresponding vector to > >>> 1 if > >>>>>>>> it > >>>>>>>>> contains a word. Then I normalize each vector using the > >>> L2-norm. > >>>>>>>>> Finally I multiply each element (representing a word) in the > >>> vector > >>>>>>>> by > >>>>>>>>> log(#documents/#documents_with_word). > >>>>>>>>> > >>>>>>>>> For clustering, I am using cosine similarity. > >>>>>>>>> > >>>>>>>>> Regards > >>>>>>>>> Sebastian > >>>>>>>>> > >>>>>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon: > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> Could you tell us more about the kind of data you're > >>> clustering? > >>>>>>>> What > >>>>>>>>>> distance measure you're using and what the dimensionality of > >>> the > >>>>>>>> data > >>>>>>>>>> is? > >>>>>>>>>> > >>>>>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister > >>>>>>>>>> <sebastian.briesemeis...@unister-gmbh.de> wrote: > >>>>>>>>>>> Dear Mahout-users, > >>>>>>>>>>> > >>>>>>>>>>> I am facing two problems when I am clustering instances with > >>> Fuzzy > >>>>>>>> c > >>>>>>>>>>> Means clustering (cosine distance, random initial > >>> clustering): > >>>>>>>>>>> > >>>>>>>>>>> 1.) I always end up with one large set of rubbish instances. > >>> All of > >>>>>>>> them > >>>>>>>>>>> have uniform cluster probability distribution and are, hence, > >>> in > >>>>>>>> the > >>>>>>>>>>> exact middle of the cluster space. > >>>>>>>>>>> The cosine distance between instances within this cluster > >>> reaches > >>>>>>>> from 0 > >>>>>>>>>>> to 1. > >>>>>>>>>>> > >>>>>>>>>>> 2.) Some of my clusters have the same or a very very similar > >>>>>>>> center. > >>>>>>>>>>> Besides the above described problems, the clustering seems to > >>> work > >>>>>>>> fine. > >>>>>>>>>>> Has somebody an idea how my clustering can be improved? > >>>>>>>>>>> > >>>>>>>>>>> Regards > >>>>>>>>>>> Sebastian > >>>>>>> -- > >>>>>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 > >>> Mail > >>>>>>> gesendet. > >> > >> -- > >> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail > >> gesendet. > >> >