Hello,

I found a severe bug in the function kmeanspp (k-means++ implementation) in the 
LICORS package 
(https://www.rdocumentation.org/packages/LICORS/versions/0.2.0/topics/kmeanspp) 
and need some advice on how to handle this (therefore posting here).
 
Since LICORS has not been updated since 2013, I am not sure if there is an 
active maintainer. The named maintainer works for Google since 2014 and may 
have abandoned the package.
 
On the other hand, this is one of the few implementations of k-means++ in R and 
coming up first when searching.
 
The bug leads to inferior results.  In its current form, the results were much 
worse than those of Hartigan-Wong (the default for k-means in the stats 
package) for all test problems I tried. However, after fixing the bug, kmeanspp 
found better results than Hartigan-Wong for all those problems. So anyone 
comparing those two algorithms based on the current implementations in R may 
have come to the wrong conclusions. BTW: The Hartigan-Wong implementation 
(Fortran) is *much* faster than LICOR/kmeanspp, which is written in pure R 
(before making one call to stats/kmeans at the end), but that is not the point 
here.
 
The bug concerns a distance computation which should be a matrix of distances 
of all data vectors and all current codebook vectors, but is not. The code and 
an example illustrating the problem is shown below. Basically, to subtract a 
vector from a matrix, one has to convert the vector into a matrix where all 
rows are just copies of the vector. Details are shown below. The fix is trivial.
 
I stumbled upon this because kmeanspp produced counterintuitive and very poor 
results.
 
Is this of any interest? Should kmeanspp in LICORS be fixed? I have neither the 
experience nor the time to take over an R-package. I could write a more formal 
bug report, if required, but I initially would like to know if this is 
considered relevant.

Suggestions/comments are welcome.
 
Kind Regards
Bernd Fritzke
 
The code 
            if (ndim == 1) {
                dists <- apply(cbind(data[center_ids, ]), 1, 
                  function(center) {
                    rowSums((data - center)^2)
                  })
            }
            else {
                dists <- apply(data[center_ids, ], 1, function(center) {
                  rowSums((data - center)^2)
                })
            }
 
should rather be (the two changed lines are marked as "fixed"):
 
            if (ndim == 1) {
                dists <- apply(cbind(data[center_ids, ]), 1, 
                  function(center) {
                    rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
                  })
            }
            else {
                dists <- apply(data[center_ids, ], 1, function(center) {
                  rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
                })
            }
 
Here is some example code illustrating the problem. The code should compute the 
square distances between the six two-dimensional vectors in "data" and three 
vectors which happen to be the elements 1, 2, and 4 in "data".
RSTUDIO

```{r}
data=cbind(1:6,rep(0,6))
print("data")
print(data)
center_ids=c(1,2,4)
print("codebook")
print(data[center_ids, ])

# fixed
dists <- apply(data[center_ids, ], 1, 
               function(center) {
                  rowSums((data - rep(center,each=nrow(data)))^2)
              }
)
print("dists (correct)")
print(dists)

# buggy
dists <- apply(data[center_ids, ], 1, 
               function(center) {
                  rowSums((data - center)^2)
              }
)
print("buggy dists from LICORS")
dists
```
Output:

[1] "data"
     [,1] [,2]
[1,]    1    0
[2,]    2    0
[3,]    3    0
[4,]    4    0
[5,]    5    0
[6,]    6    0
[1] "codebook"
     [,1] [,2]
[1,]    1    0
[2,]    2    0
[3,]    4    0
[1] "dists (correct)"
     [,1] [,2] [,3]
[1,]    0    1    9
[2,]    1    0    4
[3,]    4    1    1
[4,]    9    4    0
[5,]   16    9    1
[6,]   25   16    4
[1] "buggy dists from LICORS"
     [,1] [,2] [,3]
[1,]    1    5   25
[2,]    4    4    4
[3,]    5    5   17
[4,]   16   16   16
[5,]   17   13   17
[6,]   36   36   36
 
--
Dr.-Ing. Bernd Fritzke

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to