[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated SPARK-34545: ------------------------------------- Labels: correctness (was: ) > PySpark Python UDF return inconsistent results when applying 2 UDFs with > different return type to 2 columns together > -------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 3.0.0 > Reporter: Baohe Zhang > Priority: Blocker > Labels: correctness > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---+----+ > | c3| c4| > +---+----+ > |1.0|null| > |2.0|null| > |3.1| 3| > +---+----+ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org