Baohe Zhang created SPARK-34545: ----------------------------------- Summary: PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together Key: SPARK-34545 URL: https://issues.apache.org/jira/browse/SPARK-34545 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Baohe Zhang
Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): """x should be a list of (val, time), val can be a single element or a list """ y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---+----+ | c3| c4| +---+----+ |1.0|{color:red}null{color}| |2.0|{color:red}null{color}| |3.1| 3| +---+----+ {noformat} The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org