[ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-36553.
--------------------------------
    Fix Version/s: 3.1.3
                   3.3.0
                   3.2.2
         Assignee: zhengruifeng
       Resolution: Fixed

> KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 
> was introduced
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36553
>                 URL: https://issues.apache.org/jira/browse/SPARK-36553
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 3.1.1
>            Reporter: Anders Rydbirk
>            Assignee: zhengruifeng
>            Priority: Major
>             Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=50000,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.00005,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to