[ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anders Rydbirk updated SPARK-36553:
-----------------------------------
    Description: 
We are running KMeans on approximately 350M rows of x, y, z coordinates using 
the following configuration:
{code:java}
KMeans(
  featuresCol='features',
  predictionCol='centroid_id',
  k=50000,
  initMode='k-means||',
  initSteps=2,
  tol=0.00005,
  maxIter=20,
  seed=SEED,
  distanceMeasure='euclidean'
)
{code}
When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
consistently getting errors unless we reduce K.

Stacktrace:

 
{code:java}
An error occurred while calling o167.fit.An error occurred while calling 
o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
scala.Array$.ofDim(Array.scala:221) at 
org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
 at 
org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
 at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) at 
org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
 at scala.util.Try$.apply(Try.scala:213) at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
 at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
Source) at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:282) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.base/java.lang.Thread.run(Unknown Source)
{code}
 

The issue is introduced by 
[#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
 which significantly reduces the maximum value of K. Snippit of line that 
throws error from [DistanceMeasure.scala:|#L52]]
{code:java}
val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
{code}
 

*What we have tried:*
 * Reducing iterations
 * Reducing input volume
 * Reducing K

Only reducing K have yielded success.

 

*Possible workaround:*
 # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot be 
loaded in 3.1.1.
 # Reduce K. Currently trying with 45000.

 

*What we don't understand*:

Given the line of code above, we do not understand why we would get an integer 
overflow.

For K=50,000, packedValues should be allocated with the size of 1,250,025,000 < 
(2^31) and not result in a negative array size.

 

*Suggested resolution:*

I'm not strong in the inner workings on KMeans, but my immediate thought would 
be to add a fallback to previous logic for K larger than a set threshold if the 
optimisation is to stay in place, as it breaks compatibility from 3.0.0 to 
3.1.1 for edge cases.

 

Please let me know if more information is needed, this is my first time raising 
a bug for a OS.

  was:
We are running KMeans on approximately 350M rows of x, y, z coordinates using 
the following configuration:
{code:java}
KMeans(
  featuresCol='features',
  predictionCol='centroid_id',
  k=50000,
  initMode='k-means||',
  initSteps=2,
  tol=0.00005,
  maxIter=20,
  seed=SEED,
  distanceMeasure='euclidean'
)
{code}
When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
consistently getting errors unless we reduce K.

Stacktrace:

 
{code:java}
An error occurred while calling o167.fit.An error occurred while calling 
o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
scala.Array$.ofDim(Array.scala:221) at 
org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
 at 
org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
 at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) at 
org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
 at scala.util.Try$.apply(Try.scala:213) at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
 at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
Source) at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:282) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.base/java.lang.Thread.run(Unknown Source)
{code}
 

The issue is introduced by 
[#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
 which significantly reduces the maximum value of K. Snippit of line that 
throws error from [DistanceMeasure.scala:|#L52]]
{code:java}
val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
{code}
 

*What we have tried:*
 * Reducing iterations
 * Reducing input volume
 * Reducing K

Only reducing K have yielded success.

 

*Possible workaround:*

Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot be 
loaded in 3.1.1.

 

*What we don't understand*:

Given the line of code above, we do not understand why we would get an integer 
overflow:

For K=50,000, packedValues should be allocated with the size of 1,250,025,000 < 
(2^31) and not result in a negative array size.

Please let me know if more information is needed, this is my first time raising 
a bug for a OS.

 

*Suggested resolution:*

I'm not strong in the inner workings on KMeans, but my immediate thought would 
be to add a fallback to previous logic for K larger than a set threshold if the 
optimisation is to stay in place, as it breaks compatibility from 3.0.0 to 
3.1.1 for edge cases.


> KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 
> was introduced
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36553
>                 URL: https://issues.apache.org/jira/browse/SPARK-36553
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 3.1.1
>            Reporter: Anders Rydbirk
>            Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=50000,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.00005,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to