[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41049:
---
Labels: correctness pull-request-available  (was: correctness)

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2023-01-06 Thread Guy Boo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Boo updated SPARK-41049:

Labels: correctness  (was: )

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-11-08 Thread Guy Boo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Boo updated SPARK-41049:

Description: 
h2. Expectation

For a given row, Nondeterministic expressions are expected to have stable 
values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
df.select(v1, v1).collect{code}
Returns a set like this:
|8777|8777|
|1357|1357|
|3435|3435|
|9204|9204|
|3870|3870|

where both columns always have the same value, but what that value is changes 
from row to row. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
In this case, because the rand() calls are distinct, the values in both columns 
should be different.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#ff}2028{color}|
|8320|8320|8320|{color:#ff}1640{color}|
|7937|7937|7937|{color:#ff}769{color}|
|436|436|436|{color:#ff}8924{color}|
|8924|8924|2827|{color:#ff}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.
h2. Workaround

If the Nondeterministic expression is moved to a separate, earlier select() 
call, so the CodegenFallback instead only refers to a column reference, then 
the problem seems to go away. But this workaround may not be reliable if 
optimization is ever able to restructure adjacent select()s.

  was:
h2. Expectation

For a given row, Nondeterministic expressions are expected to have stable 
values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
df.select(v1, v1).collect{code}
Returns a set like this:
|8777|8777|
|1357|1357|
|3435|3435|
|9204|9204|
|3870|3870|

where both columns always have the same value, but what that value is changes 
from row to row. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
In this case, because the rand() calls are distinct, the values in both columns 
should be different.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#ff}2028{color}|
|8320|8320|8320|{color:#ff}1640{color}|
|7937|7937|7937|{color:#ff}769{color}|
|436|436|436|{color:#ff}8924{color}|
|8924|8924|2827|{color:#ff}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.
h2. Workaround

If the Nondeterministic expression is moved to a separate, earlier select() 
call, and the CodegenFallback instead only refers to a column reference, then 
the problem seems to go away. But I don't know if this workaround is expected 
to be reliable if optimization is ever able to restructure adjacent select()s.


> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> 

[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-11-08 Thread Guy Boo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Boo updated SPARK-41049:

Description: 
h2. Expectation

For a given row, Nondeterministic expressions are expected to have stable 
values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
df.select(v1, v1).collect{code}
Returns a set like this:
|8777|8777|
|1357|1357|
|3435|3435|
|9204|9204|
|3870|3870|

where both columns always have the same value, but what that value is changes 
from row to row. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
In this case, because the rand() calls are distinct, the values in both columns 
should be different.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#ff}2028{color}|
|8320|8320|8320|{color:#ff}1640{color}|
|7937|7937|7937|{color:#ff}769{color}|
|436|436|436|{color:#ff}8924{color}|
|8924|8924|2827|{color:#ff}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.
h2. Workaround

If the Nondeterministic expression is moved to a separate, earlier select() 
call, and the CodegenFallback instead only refers to a column reference, then 
the problem seems to go away. But I don't know if this workaround is expected 
to be reliable if optimization is ever able to restructure adjacent select()s.

  was:
h2. Expectation

For a given row, Nondeterministic expressions are expected to have stable 
values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
df.select(v1, v1).collect{code}
Returns a set like this:
|8777|8777|
|1357|1357|
|3435|3435|
|9204|9204|
|3870|3870|

where both columns always have the same value, but what that value is changes 
from row to row. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
In this case, because the rand() calls are distinct, the values in both columns 
should be different.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#ff}2028{color}|
|8320|8320|8320|{color:#ff}1640{color}|
|7937|7937|7937|{color:#ff}769{color}|
|436|436|436|{color:#ff}8924{color}|
|8924|8924|2827|{color:#ff}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.


> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> 

[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-11-08 Thread Guy Boo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Boo updated SPARK-41049:

Description: 
h2. Expectation

For a given row, Nondeterministic expressions are expected to have stable 
values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
df.select(v1, v1).collect{code}
Returns a set like this:
|8777|8777|
|1357|1357|
|3435|3435|
|9204|9204|
|3870|3870|

where both columns always have the same value, but what that value is changes 
from row to row. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
In this case, because the rand() calls are distinct, the values in both columns 
should be different.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#ff}2028{color}|
|8320|8320|8320|{color:#ff}1640{color}|
|7937|7937|7937|{color:#ff}769{color}|
|436|436|436|{color:#ff}8924{color}|
|8924|8924|2827|{color:#ff}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.

  was:
h2. Expectation

For a given row, Nondeterministic expressions should have stable values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)) 
df.select(v1, v1).collect{code}
Should return a set where both columns always have the same value, but what 
that value is changes from row to row. This is true for composed expressions as 
well:
{code:scala}
df.select(v1.cast(IntegerType), v1.cast(IntegerType)).collect
{code}
should still have the same value in both columns. This is different from the 
following:
{code:scala}
df.select(rand(), rand()).collect{code}
Should always have different values in each column, because the two rand() 
calls refer to different invocations.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent 
expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(1)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#FF}2028{color}|
|8320|8320|8320|{color:#FF}1640{color}|
|7937|7937|7937|{color:#FF}769{color}|
|436|436|436|{color:#FF}8924{color}|
|8924|8924|2827|{color:#FF}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct 
while subsequent calls aren't.


> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")