[Yes, it should be on the dev list. I got confused.] The thing is, it's happening when using just 1 mapper. The hypercube tests indicate that the 3 versions of StreamingKMeans produce about the same results. I haven't tested them on the _unprojected_ vectors though.
Average distance in cluster 0 [18773]: 68.237385 Average distance in cluster 1 [2]: 5.973227 Average distance in cluster 2 [1]: 0.000000 Average distance in cluster 3 [4]: 279.200390 Average distance in cluster 4 [5]: 394.101672 Average distance in cluster 5 [4]: 227.845612 Average distance in cluster 6 [1]: 0.000000 Average distance in cluster 7 [2]: 28.779806 Average distance in cluster 8 [1]: 0.000000 Average distance in cluster 9 [2]: 215.254876 Average distance in cluster 10 [3]: 128.501163 Average distance in cluster 11 [8]: 534.401649 Average distance in cluster 12 [1]: 0.000000 Average distance in cluster 13 [5]: 405.115140 Average distance in cluster 14 [1]: 0.000000 Average distance in cluster 15 [9]: 215.797289 Average distance in cluster 16 [1]: 0.000000 Average distance in cluster 17 [2]: 123.065677 Average distance in cluster 18 [1]: 0.000000 Average distance in cluster 19 [2]: 98.733778 Num clusters: 20; maxDistance: 762.326896 On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <[email protected]> wrote: > I will have to think on this a bit. > > It should be possible to dump the sketches coming from each mapper and look > at them for compatibility. > > Are the mappers seeing only docs from a single news group? That might > produce some interesting and odd results. > > What happens with the sequential version when you specify as many threads > as you have mappers in the MR version? > > Also, sholdn't this be on the dev list? > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon > <[email protected]>wrote: > >> So no, apparently the problem's still there. With the most recent code, I >> get: >> >> Average distance in cluster 0 [1]: 0.000000 >> Average distance in cluster 1 [18775]: 63.839819 >> Average distance in cluster 2 [11]: 448.706077 >> Average distance in cluster 3 [1]: 0.000000 >> Average distance in cluster 4 [8]: 213.629578 >> Average distance in cluster 5 [1]: 0.000000 >> Average distance in cluster 6 [10]: 369.592682 >> Average distance in cluster 7 [1]: 0.000000 >> Average distance in cluster 8 [2]: 31.061103 >> Average distance in cluster 9 [1]: 0.000000 >> Average distance in cluster 10 [2]: 309.934857 >> Average distance in cluster 11 [1]: 0.000000 >> Average distance in cluster 12 [1]: 0.000000 >> Average distance in cluster 13 [1]: 0.000000 >> Average distance in cluster 14 [1]: 0.000000 >> Average distance in cluster 15 [4]: 229.180504 >> Average distance in cluster 16 [1]: 0.000000 >> Average distance in cluster 17 [3]: 336.835246 >> Average distance in cluster 18 [2]: 76.485594 >> Average distance in cluster 19 [1]: 0.000000 >> Num clusters: 20; maxDistance: 724.060033 >> >> I'll have to recheck. :/ >> >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected]> >> wrote: >> > Hot damn! >> > >> > Well spotted. >> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon >> > <[email protected]>wrote: >> > >> >> Ted, remember we talked about this last week? >> >> >> >> The problem was (I think it's fixed now) that when I was asking for 20 >> >> clusters, every mapper would give me 20 clusters (rather than k log n >> >> ~ 200) and the points clumped together resulting in one cluster with >> >> the vast majority of the points ~17K out the ~19K. >> >> >> >> Now that I fixed that added more tests that seem to be confirming all >> >> StreamingKMeans implementations get about the same results (whether >> >> they're local or MapReduce) and the multiple restarts of BallKMeans, >> >> I'm expecting it to be a lot better. >> >> >> >> Actual data tests coming soon (please check that new cluster thread). :) >> >> >>
