Anders Rydbirk created SPARK-36553: -------------------------------------- Summary: KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced Key: SPARK-36553 URL: https://issues.apache.org/jira/browse/SPARK-36553 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 3.1.1 Reporter: Anders Rydbirk
We are running KMeans on approximately 350M rows of x, y, z coordinates using the following configuration: {code:java} KMeans( featuresCol='features', predictionCol='centroid_id', k=50000, initMode='k-means||', initSteps=2, tol=0.00005, maxIter=20, seed=SEED, distanceMeasure='euclidean' ) {code} When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are consistently getting errors unless we reduce K. Stacktrace: {code:java} An error occurred while calling o167.fit.An error occurred while calling o167.fit.: java.lang.NegativeArraySizeException: -897458648 at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at scala.Array$.ofDim(Array.scala:221) at org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) at org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Unknown Source) {code} The issue is introduced by [#27758|[https://github.com/apache/spark/pull/27758/files#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] which significantly reduces the maximum value of K: [DistanceMeasure.scala|[https://github.com/zhengruifeng/spark/blob/d31d488e0e48a82fd5b43c406f07b8c7d27dd53c/mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala#L52]] {code:java} val packedValues = Array.ofDim[Double](k * (k + 1) / 2) {code} *What we have tried:* * Reducing iterations * Reducing input volume * Reducing K Only reducing K have yielded success. *What we don't understand*: **Given the line of code above, we do not understand why we would get an integer overflow: For K=50,000, packedValues should be allocated with the size of 1,250,025,000 < (2^31) and not result in a negative array size. Please let me know if more information is needed, this is my first time raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org