[ https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruifeng Zheng updated SPARK-38584: ---------------------------------- Priority: Major (was: Minor) > Unify the data validation > ------------------------- > > Key: SPARK-38584 > URL: https://issues.apache.org/jira/browse/SPARK-38584 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.4.0 > Reporter: Ruifeng Zheng > Assignee: Ruifeng Zheng > Priority: Major > Fix For: 3.4.0 > > > 1, input vector validation is missing in most algorithms, when the input > dataset contains some invalid values (NaN/Infinity), then: > * the training may run successfuly and return model containing invalid > coefficients, like LinearSVC > * the training may fail with irrelevant message, like KMeans > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0)))).toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] > val km = new KMeans().setK(2) > scala> km.fit(df) > 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID > 113) > java.lang.IllegalArgumentException: requirement failed: Both norms should be > greater or equal to 0.0, found norm1=NaN, norm2=Infinity > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) > {code} > > We should make ml algorithms fail fast, if the input dataset is invalid. > > 2, there exists some methods to validate input labels and weights in > different files: > * {{org.apache.spark.ml.functions}} > * org.apache.spark.ml.util.DatasetUtils > * org.apache.spark.ml.util.MetadataUtils, > * org.apache.spark.ml.Predictor > * etc. > > I think it is time to unify realtive methods to one source file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org