subject:"\[jira\] \[Commented\] \(SPARK\-32635\) When pyspark.sql.functions.lit\(\) function is used with dataframe cache, it returns wrong result"

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198256#comment-17198256
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29805

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: Vinod KC
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198219#comment-17198219
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29802

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: Vinod KC
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196910#comment-17196910
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29771

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-09 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192740#comment-17192740
 ] 

Peter Toth commented on SPARK-32635:


I was able to repro the issue see the root cause. Started working on a fix.

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-08-21 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181800#comment-17181800
 ] 

Vinod KC commented on SPARK-32635:
--

[~sascha.baumanns], updated code section. Thanks

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-08-20 Thread Sascha Baumanns (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181076#comment-17181076
 ] 

Sascha Baumanns commented on SPARK-32635:
-

Please close the brackets in the first code section: final.show()

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show(
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

6 matches

Site Navigation

Mail list logo

Footer information