[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390434#comment-16390434 ]
yogesh garg edited comment on SPARK-23562 at 3/7/18 11:33 PM: -------------------------------------------------------------- Error in question can be reproduced with the following code in scala {code:scala} val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test}{code} {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task ** in stage ** failed ** times, most recent failure: Lost task ** in stage ** (TID **, **, executor **): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<id2_double_rFormula_1b829d1fadd6:double>) => vector) Caused by: org.apache.spark.SparkException: Values to assemble cannot be null. {code} was (Author: yogeshgarg): Error in question can be reproduced with the following code in scala {code:scala} val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test}{code} > RFormula handleInvalid should handle invalid values in non-string columns. > -------------------------------------------------------------------------- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.3.0 > Reporter: Bago Amirbekian > Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org