I will have to think on this a bit.

It should be possible to dump the sketches coming from each mapper and look
at them for compatibility.

Are the mappers seeing only docs from a single news group?  That might
produce some interesting and odd results.

What happens with the sequential version when you specify as many threads
as you have mappers in the MR version?

Also, sholdn't this be on the dev list?

On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> So no, apparently the problem's still there. With the most recent code, I
> get:
>
> Average distance in cluster 0 [1]: 0.000000
> Average distance in cluster 1 [18775]: 63.839819
> Average distance in cluster 2 [11]: 448.706077
> Average distance in cluster 3 [1]: 0.000000
> Average distance in cluster 4 [8]: 213.629578
> Average distance in cluster 5 [1]: 0.000000
> Average distance in cluster 6 [10]: 369.592682
> Average distance in cluster 7 [1]: 0.000000
> Average distance in cluster 8 [2]: 31.061103
> Average distance in cluster 9 [1]: 0.000000
> Average distance in cluster 10 [2]: 309.934857
> Average distance in cluster 11 [1]: 0.000000
> Average distance in cluster 12 [1]: 0.000000
> Average distance in cluster 13 [1]: 0.000000
> Average distance in cluster 14 [1]: 0.000000
> Average distance in cluster 15 [4]: 229.180504
> Average distance in cluster 16 [1]: 0.000000
> Average distance in cluster 17 [3]: 336.835246
> Average distance in cluster 18 [2]: 76.485594
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 724.060033
>
> I'll have to recheck. :/
>
> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > Hot damn!
> >
> > Well spotted.
> >
> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> > <dangeorge.fili...@gmail.com>wrote:
> >
> >> Ted, remember we talked about this last week?
> >>
> >> The problem was (I think it's fixed now) that when I was asking for 20
> >> clusters, every mapper would give me 20 clusters (rather than k log n
> >> ~ 200) and the points clumped together resulting in one cluster with
> >> the vast majority of the points ~17K out the ~19K.
> >>
> >> Now that I fixed that added more tests that seem to be confirming all
> >> StreamingKMeans implementations get about the same results (whether
> >> they're local or MapReduce) and the multiple restarts of BallKMeans,
> >> I'm expecting it to be a lot better.
> >>
> >> Actual data tests coming soon (please check that new cluster thread). :)
> >>
>

Reply via email to