So no, apparently the problem's still there. With the most recent code, I get:

Average distance in cluster 0 [1]: 0.000000
Average distance in cluster 1 [18775]: 63.839819
Average distance in cluster 2 [11]: 448.706077
Average distance in cluster 3 [1]: 0.000000
Average distance in cluster 4 [8]: 213.629578
Average distance in cluster 5 [1]: 0.000000
Average distance in cluster 6 [10]: 369.592682
Average distance in cluster 7 [1]: 0.000000
Average distance in cluster 8 [2]: 31.061103
Average distance in cluster 9 [1]: 0.000000
Average distance in cluster 10 [2]: 309.934857
Average distance in cluster 11 [1]: 0.000000
Average distance in cluster 12 [1]: 0.000000
Average distance in cluster 13 [1]: 0.000000
Average distance in cluster 14 [1]: 0.000000
Average distance in cluster 15 [4]: 229.180504
Average distance in cluster 16 [1]: 0.000000
Average distance in cluster 17 [3]: 336.835246
Average distance in cluster 18 [2]: 76.485594
Average distance in cluster 19 [1]: 0.000000
Num clusters: 20; maxDistance: 724.060033

I'll have to recheck. :/

On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Hot damn!
>
> Well spotted.
>
> On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> <dangeorge.fili...@gmail.com>wrote:
>
>> Ted, remember we talked about this last week?
>>
>> The problem was (I think it's fixed now) that when I was asking for 20
>> clusters, every mapper would give me 20 clusters (rather than k log n
>> ~ 200) and the points clumped together resulting in one cluster with
>> the vast majority of the points ~17K out the ~19K.
>>
>> Now that I fixed that added more tests that seem to be confirming all
>> StreamingKMeans implementations get about the same results (whether
>> they're local or MapReduce) and the multiple restarts of BallKMeans,
>> I'm expecting it to be a lot better.
>>
>> Actual data tests coming soon (please check that new cluster thread). :)
>>

Reply via email to