Tim Sell created SPARK-17100: -------------------------------- Summary: pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException Key: SPARK-17100 URL: https://issues.apache.org/jira/browse/SPARK-17100 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. Reporter: Tim Sell
In pyspark, when filtering on a udf derived column after some join types, the optimized logical plan results is a java.lang.UnsupportedOperationException. I could not replicate this in scala code from the shell, just python. It is a pyspark regression from spark 1.6.2. This can be replicated with: bin/spark-submit bug.py {code:python:title=bug.py} import pyspark.sql.functions as F from pyspark.sql import Row, SparkSession if __name__ == '__main__': spark = SparkSession.builder.appName("test").getOrCreate() left = spark.createDataFrame([Row(a=1)]) right = spark.createDataFrame([Row(a=1)]) df = left.join(right, on='a', how='left_outer') df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) df = df.filter('b = "x"') df.explain(extended=True) {code} The output is: {code} == Parsed Logical Plan == 'Filter ('b = x) +- Project [a#0L, <lambda>(a#0L) AS b#8] +- Project [a#0L] +- Join LeftOuter, (a#0L = a#3L) :- LogicalRDD [a#0L] +- LogicalRDD [a#3L] == Analyzed Logical Plan == a: bigint, b: string Filter (b#8 = x) +- Project [a#0L, <lambda>(a#0L) AS b#8] +- Project [a#0L] +- Join LeftOuter, (a#0L = a#3L) :- LogicalRDD [a#0L] +- LogicalRDD [a#3L] == Optimized Logical Plan == java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, true]) == Physical Plan == java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, true]) {code} It fails when the join is: * how='outer', on=column expression * how='left_outer', on=string or column expression * how='right_outer', on=string or column expression It passes when the join is: * how='inner', on=string or column expression * how='outer', on=string I made some tests to demonstrate each of these. Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org