[ https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-12981: ------------------------------ Assignee: Davies Liu > Dataframe distinct() followed by a filter(udf) in pyspark throws a casting > error > -------------------------------------------------------------------------------- > > Key: SPARK-12981 > URL: https://issues.apache.org/jira/browse/SPARK-12981 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.6.0 > Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8) > Reporter: Tom Arnfeld > Assignee: Davies Liu > Priority: Critical > Fix For: 2.0.0 > > > We noticed a regression when testing out an upgrade of Spark 1.6 for our > systems, where pyspark throws a casting exception when using `filter(udf)` > after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. > Here's a little notebook that demonstrates the exception clearly... > https://gist.github.com/tarnfeld/ab9b298ae67f697894cd > Though for the sake of here... the following code will throw an exception... > {code} > data.select(col("a")).distinct().filter(my_filter(col("a"))).count() > {code} > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to > org.apache.spark.sql.catalyst.plans.logical.Aggregate > {code} > Whereas not using a UDF does not throw any errors... > {code} > data.select(col("a")).distinct().filter("a = 1").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org