zhengruifeng created SPARK-38584:
------------------------------------

             Summary: Unify the data validation
                 Key: SPARK-38584
                 URL: https://issues.apache.org/jira/browse/SPARK-38584
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.4.0
            Reporter: zhengruifeng


1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly with invalid model, like LinearSVC
 * the training will fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils, etc.

 

I think it is time to unify realtive methods to one source file.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to