[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390434#comment-16390434 ]
yogesh garg commented on SPARK-23562: ------------------------------------- Error in question can be reproduced with the following code in scala ``` val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test} ``` > RFormula handleInvalid should handle invalid values in non-string columns. > -------------------------------------------------------------------------- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.3.0 > Reporter: Bago Amirbekian > Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org