zhengruifeng created SPARK-38584: ------------------------------------ Summary: Unify the data validation Key: SPARK-38584 URL: https://issues.apache.org/jira/browse/SPARK-38584 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.4.0 Reporter: zhengruifeng
1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then: * the training may run successfuly with invalid model, like LinearSVC * the training will fail with irrelevant message, like KMeans {code:java} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.clustering._ val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF() val svc = new LinearSVC() val model = svc.fit(df) scala> model.intercept res0: Double = NaN scala> model.coefficients res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] val km = new KMeans().setK(2) scala> km.fit(df) 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113) java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity at scala.Predef$.require(Predef.scala:281) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) {code} 2, relative methods to validate input dataset (like labels/weights) exists in {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, org.apache.spark.ml.util.MetadataUtils, etc. I think it is time to unify realtive methods to one source file. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org