[ https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348438#comment-15348438 ]
Xiangrui Meng commented on SPARK-16164: --------------------------------------- [~lian cheng] See my last comment on GitHub: I didn't use UDF explicitly in a filter expression. It is like the following: {code} filter b > 0 set a = udf(b) filter a > 2 and the optimizer merged them into {code} {code} filter udf(b) > 2 and b > 0 {code} This could happen for any UDFs that throw exceptions. Are we assuming that UDFs never throw exceptions? > CombineFilters should keep the ordering in the logical plan > ----------------------------------------------------------- > > Key: SPARK-16164 > URL: https://issues.apache.org/jira/browse/SPARK-16164 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Xiangrui Meng > Assignee: Dongjoon Hyun > Fix For: 2.0.1, 2.1.0 > > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. I'm not sure whether we should treat this as a > bug. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org