You know, regarding the latest clustering with CosineDistance. How is the _mean_ distance larger (or even close to) 1 if cos is in [-1, 1]? ...
On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon <[email protected]>wrote: > And I'll add that re-vectorizing the documents with my vectorizer yields > essentially the same results (this is CosineDistance though): > > Average distance in cluster 0 [6]: 0.844053 > Average distance in cluster 1 [1047]: 0.988517 > Average distance in cluster 2 [26]: 0.889580 > Average distance in cluster 3 [19]: 0.922804 > Average distance in cluster 4 [2]: 0.414935 > Average distance in cluster 5 [9]: 0.777650 > Average distance in cluster 6 [4]: 0.791443 > Average distance in cluster 7 [17432]: 1.017289 > Average distance in cluster 8 [20]: 0.917523 > Average distance in cluster 9 [4]: 0.744159 > Average distance in cluster 10 [2]: 0.340740 > Average distance in cluster 11 [3]: 0.614734 > Average distance in cluster 12 [2]: 0.624274 > Average distance in cluster 13 [62]: 0.922437 > Average distance in cluster 14 [2]: 0.324862 > Average distance in cluster 15 [1]: 0.000000 > Average distance in cluster 16 [94]: 0.917509 > Average distance in cluster 17 [103]: 0.944392 > Average distance in cluster 18 [7]: 0.795449 > Average distance in cluster 19 [1]: 0.000000 > Num clusters: 20; maxDistance: 1.029701 > > > On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon > <[email protected]>wrote: > >> You know what's even more odd? When I used Mahout's KMeans, everything >> was assigned to one single cluster with mean distance 64. >> >> >> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <[email protected]>wrote: >> >>> Hmm... looking at these outputs, it looks like the big cluster is really >>> tight ... much tighter than cluster 3 or 4. That is very odd. >>> >>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon >>> <[email protected]>wrote: >>> >>> > [Yes, it should be on the dev list. I got confused.] >>> > >>> > The thing is, it's happening when using just 1 mapper. The hypercube >>> > tests indicate that the 3 versions of StreamingKMeans produce about >>> > the same results. >>> > I haven't tested them on the _unprojected_ vectors though. >>> > >>> > Average distance in cluster 0 [18773]: 68.237385 >>> > Average distance in cluster 1 [2]: 5.973227 >>> > Average distance in cluster 2 [1]: 0.000000 >>> > Average distance in cluster 3 [4]: 279.200390 >>> > Average distance in cluster 4 [5]: 394.101672 >>> > Average distance in cluster 5 [4]: 227.845612 >>> > Average distance in cluster 6 [1]: 0.000000 >>> > Average distance in cluster 7 [2]: 28.779806 >>> > Average distance in cluster 8 [1]: 0.000000 >>> > Average distance in cluster 9 [2]: 215.254876 >>> > Average distance in cluster 10 [3]: 128.501163 >>> > Average distance in cluster 11 [8]: 534.401649 >>> > Average distance in cluster 12 [1]: 0.000000 >>> > Average distance in cluster 13 [5]: 405.115140 >>> > Average distance in cluster 14 [1]: 0.000000 >>> > Average distance in cluster 15 [9]: 215.797289 >>> > Average distance in cluster 16 [1]: 0.000000 >>> > Average distance in cluster 17 [2]: 123.065677 >>> > Average distance in cluster 18 [1]: 0.000000 >>> > Average distance in cluster 19 [2]: 98.733778 >>> > Num clusters: 20; maxDistance: 762.326896 >>> > >>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <[email protected]> >>> > wrote: >>> > > I will have to think on this a bit. >>> > > >>> > > It should be possible to dump the sketches coming from each mapper >>> and >>> > look >>> > > at them for compatibility. >>> > > >>> > > Are the mappers seeing only docs from a single news group? That >>> might >>> > > produce some interesting and odd results. >>> > > >>> > > What happens with the sequential version when you specify as many >>> threads >>> > > as you have mappers in the MR version? >>> > > >>> > > Also, sholdn't this be on the dev list? >>> > > >>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon < >>> > [email protected]>wrote: >>> > > >>> > >> So no, apparently the problem's still there. With the most recent >>> code, >>> > I >>> > >> get: >>> > >> >>> > >> Average distance in cluster 0 [1]: 0.000000 >>> > >> Average distance in cluster 1 [18775]: 63.839819 >>> > >> Average distance in cluster 2 [11]: 448.706077 >>> > >> Average distance in cluster 3 [1]: 0.000000 >>> > >> Average distance in cluster 4 [8]: 213.629578 >>> > >> Average distance in cluster 5 [1]: 0.000000 >>> > >> Average distance in cluster 6 [10]: 369.592682 >>> > >> Average distance in cluster 7 [1]: 0.000000 >>> > >> Average distance in cluster 8 [2]: 31.061103 >>> > >> Average distance in cluster 9 [1]: 0.000000 >>> > >> Average distance in cluster 10 [2]: 309.934857 >>> > >> Average distance in cluster 11 [1]: 0.000000 >>> > >> Average distance in cluster 12 [1]: 0.000000 >>> > >> Average distance in cluster 13 [1]: 0.000000 >>> > >> Average distance in cluster 14 [1]: 0.000000 >>> > >> Average distance in cluster 15 [4]: 229.180504 >>> > >> Average distance in cluster 16 [1]: 0.000000 >>> > >> Average distance in cluster 17 [3]: 336.835246 >>> > >> Average distance in cluster 18 [2]: 76.485594 >>> > >> Average distance in cluster 19 [1]: 0.000000 >>> > >> Num clusters: 20; maxDistance: 724.060033 >>> > >> >>> > >> I'll have to recheck. :/ >>> > >> >>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected] >>> > >>> > >> wrote: >>> > >> > Hot damn! >>> > >> > >>> > >> > Well spotted. >>> > >> > >>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon >>> > >> > <[email protected]>wrote: >>> > >> > >>> > >> >> Ted, remember we talked about this last week? >>> > >> >> >>> > >> >> The problem was (I think it's fixed now) that when I was asking >>> for >>> > 20 >>> > >> >> clusters, every mapper would give me 20 clusters (rather than k >>> log n >>> > >> >> ~ 200) and the points clumped together resulting in one cluster >>> with >>> > >> >> the vast majority of the points ~17K out the ~19K. >>> > >> >> >>> > >> >> Now that I fixed that added more tests that seem to be >>> confirming all >>> > >> >> StreamingKMeans implementations get about the same results >>> (whether >>> > >> >> they're local or MapReduce) and the multiple restarts of >>> BallKMeans, >>> > >> >> I'm expecting it to be a lot better. >>> > >> >> >>> > >> >> Actual data tests coming soon (please check that new cluster >>> > thread). :) >>> > >> >> >>> > >> >>> > >>> >> >> >
