[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-03-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34545:
-
Fix Version/s: 3.0.3

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-03-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34545:
-
Fix Version/s: 3.1.2

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.2
>
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-28 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34545:
-
Labels: correctness  (was: )

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Blocker
>  Labels: correctness
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-26 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Priority: Blocker  (was: Critical)

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Blocker
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Summary: PySpark Python UDF return inconsistent results when applying 2 
UDFs with different return type to 2 columns together  (was: PySpark Python UDF 
return inconsistent results when applying UDFs to 2 columns together)

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org