[jira] [Commented] (SPARK-32569) Gaussian can not handle data close to MaxDouble

Tobias Haar (Jira) Mon, 24 Aug 2020 13:32:33 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183583#comment-17183583
 ]


Tobias Haar commented on SPARK-32569:
-------------------------------------

Thanks for your feedback! I hopefully removed most of the noise from the 
stacktrace now.

Sorry for the confusion, I was indeed talking about GaussianMixtureModels. The 
values I was referring to are contained in the data that I referenced in the 
description which causes crashes due to the unsuccessful conversion of the 
input data when trying to fit the model (starting from 
org.apache.spark.ml.clustering.GaussianMixture.fit(GaussianMixture.scala:374) 
in the stacktrace).

I agree that of course k-means and GaussianMixtureModels are definitely not the 
same and hence they will handle overflow/convergence problems differently. 
Input in real world scenarios would only in very few cases be that large. In 
some physics applications extreme numerical values in the MaxDouble range are 
not uncommon. My point is that no matter how extreme the data is there should 
not be crashes and the above stacktrace to me looks like extreme data leads to 
a crash here (unless the NotConvergedException is actually thrown to handle 
this error). Please correct me if I am wrong.

> Gaussian can not handle data close to MaxDouble
> -----------------------------------------------
>
>                 Key: SPARK-32569
>                 URL: https://issues.apache.org/jira/browse/SPARK-32569
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.0.0
>         Environment: Running Spark in local mode within java application on 
> Windows 10
>            Reporter: Tobias Haar
>            Priority: Major
>
> Running Gaussian from Apache Spark MLlib with [this 
> dataset|[https://user.informatik.uni-goettingen.de/~sherbol/MaxDouble.arff]] 
> containing values close to MaxDouble (values >10^306) results in the error 
> below. KMeans and Bisecting KMeans can both handle the same dataset which for 
> me raises the question, if this would be a bug or to be expected behavior.
> Stacktrace:
> org.apache.spark.SparkException: Failed to execute user defined 
> function(GaussianMixtureModel$$Lambda$2841/0x00000001003ab040: 
> (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => 
> struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1070)
>  at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
>  at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
> at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.applyOrElse(Optimizer.scala:1502)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
> at 
> org.apache.spark.ml.clustering.ClusteringSummary.clusterSizes$lzycompute(ClusteringSummary.scala:49)
>  at 
> org.apache.spark.ml.clustering.GaussianMixture.fit(GaussianMixture.scala:374)
>  Caused by: breeze.linalg.NotConvergedException: 
>  at breeze.linalg.eigSym$.breeze$linalg$eigSym$$doEigSym(eig.scala:164)
>  at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:111)
>  at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:109)
>  at breeze.generic.UFunc.apply(UFunc.scala:46)
>  at breeze.generic.UFunc.apply$(UFunc.scala:45)
>  at breeze.linalg.eigSym$.apply(eig.scala:106)
>  at 
> org.apache.spark.ml.stat.distribution.MultivariateGaussian.calculateCovarianceConstants(MultivariateGaussian.scala:117)
>  at 
> org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1$lzycompute(MultivariateGaussian.scala:58)
>  at 
> org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1(MultivariateGaussian.scala:58)
>  at 
> org.apache.spark.ml.clustering.GaussianMixtureModel$.computeProbabilities(GaussianMixture.scala:287)
>  at 
> org.apache.spark.ml.clustering.GaussianMixtureModel.predictProbability(GaussianMixture.scala:171)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32569) Gaussian can not handle data close to MaxDouble

Reply via email to