[ 
https://issues.apache.org/jira/browse/MATH-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruiqi Dong updated MATH-1683:
-----------------------------
    Priority: Minor  (was: Major)

> ElkanKMeansPlusPlusClusterer throws ArrayIndexOutOfBoundsException when all 
> input points are identical
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MATH-1683
>                 URL: https://issues.apache.org/jira/browse/MATH-1683
>             Project: Commons Math
>          Issue Type: Bug
>          Components: legacy
>    Affects Versions: 4.0-beta1
>            Reporter: Ruiqi Dong
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The {{ElkanKMeansPlusPlusClusterer}} class throws an 
> {{ArrayIndexOutOfBoundsException}} when attempting to cluster a dataset where 
> all points are identical and k > 1. This occurs during the k-means++ 
> initialization phase in the {{seed()}} method.
> *Test Case:*
>  
> {code:java}
> @Test
> public void testAllPointsIdentical() {
>     // Create 4 identical points
>     List<DoublePoint> points = Arrays.asList(
>         new DoublePoint(new double[]{0.0, 0.0}),
>         new DoublePoint(new double[]{0.0, 0.0}),
>         new DoublePoint(new double[]{0.0, 0.0}),
>         new DoublePoint(new double[]{0.0, 0.0})
>     );
>     
>     // Attempt to cluster into 2 clusters
>     ElkanKMeansPlusPlusClusterer<DoublePoint> clusterer = 
>         new ElkanKMeansPlusPlusClusterer<>(2);
>     
>     // This throws ArrayIndexOutOfBoundsException
>     clusterer.cluster(points);
> }{code}
> *Test Result:*
>  
>  
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: -1
>     at java.util.ArrayList.elementData(ArrayList.java:424)
>     at java.util.ArrayList.get(ArrayList.java:437)
>     at 
> org.apache.commons.math4.legacy.ml.clustering.ElkanKMeansPlusPlusClusterer.seed(ElkanKMeansPlusPlusClusterer.java:244)
>     at 
> org.apache.commons.math4.legacy.ml.clustering.ElkanKMeansPlusPlusClusterer.cluster(ElkanKMeansPlusPlusClusterer.java:128)
>  {code}
> *Root Cause Analysis:*
> In the {{seed()}} method, the k-means++ initialization algorithm fails when 
> all points are identical:
>  # After selecting the first center, all distances to this center are 0 
> (since all points are identical)
>  # {{sumSqDist}} becomes 0 as all {{minDistances[i] = 0}}
>  # In subsequent iterations, {{p = sumSqDist * random.nextDouble()}} equals 0
>  # The loop {{for (double cdf = 0; cdf < p; next++)}} never executes because 
> {{0 < 0}} is false
>  # {{next}} remains 0, causing {{points.get(next - 1)}} to access index -1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to