[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query.
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan]
[info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
[info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
[info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}
The non-deterministic expression can be safely allowed for my custom 
LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis 
rule is too strict so that reasonable use cases of non-deterministic 
expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query.
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan]
[info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.

[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query.
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan]
[info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
[info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
[info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}
The non-deterministic expression can be safely allowed for my custom 
LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis 
rule is too strict so that reasonable use cases of non-deterministic 
expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query.
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis

[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query.
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
[info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
[info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}
The non-deterministic expression can be safely allowed for my custom 
LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis 
rule is too strict so that reasonable use cases of non-deterministic 
expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
o

[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
[info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
[info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
 {code}

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1; 
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) 
[

[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
[info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
[info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
[info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
[info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
[info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
[info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66){code}

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
[info] at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
[info] at 
o

[jira] [Updated] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48871:

Description: 
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
{code:java}
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1; 
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) 
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
 [info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
 [info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
 [info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
 [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) 
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
 {code}

  was:
I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1; 
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala

[jira] [Created] (SPARK-48871) Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in CheckAnalysis

2024-07-11 Thread Carmen Kwan (Jira)
Carmen Kwan created SPARK-48871:
---

 Summary: Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in 
CheckAnalysis 
 Key: SPARK-48871
 URL: https://issues.apache.org/jira/browse/SPARK-48871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0, 3.5.2, 3.4.4
Reporter: Carmen Kwan


I encountered the following exception when attempting to use a 
non-deterministic udf in my query. The non-deterministic expression can be 
safely allowed for my custom LogicalPlan, but it is disabled in the 
checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable 
use cases of non-deterministic expressions are also disabled.

To fix this, we could add a trait that logical plans can extend to implement a 
method to decide whether there can be non-deterministic expressions for the 
operator, and check this function in checkAnalysis. This allows delegation of 
this validation to frameworks that extend Spark so we can allow list more than 
just the few explicitly named logical plans (e.g. `Project`, `Filter`). 
[info] org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic 
expression, but the actual expression is "[some expression]".; line 2 pos 1; 
[info] [some logical plan] [info] at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:761)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) 
[info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160)
 [info] at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
 [info] at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
 [info] at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
 [info] at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
 [info] at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
 [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) 
[info] at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
 [info] at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 
[info] at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48473) CheckAnalysis should be more flexible

2024-07-10 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan resolved SPARK-48473.
-
Fix Version/s: (was: 4.0.0)
   Resolution: Abandoned

> CheckAnalysis should be more flexible
> -
>
> Key: SPARK-48473
> URL: https://issues.apache.org/jira/browse/SPARK-48473
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Carmen Kwan
>Priority: Major
>
> CheckAnalysis should be more flexible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48473) Add extensible trait to allow-list non-deterministic expressions in operators in CheckAnalysis

2024-07-08 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48473:

Fix Version/s: 4.0.0
   3.5.2

> Add extensible trait to allow-list non-deterministic expressions in operators 
> in CheckAnalysis
> --
>
> Key: SPARK-48473
> URL: https://issues.apache.org/jira/browse/SPARK-48473
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Carmen Kwan
>Priority: Major
> Fix For: 4.0.0, 3.5.2
>
>
> CheckAnalysis throws an `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception 
> when there is a non-deterministic expression within an operator that is not 
> allow listed in the case match check 
> [below|https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L773-L784]:
>  
> {code:java}
>  case o if o.expressions.exists(!_.deterministic) &&
>             !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>             !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] &&
>             !o.isInstanceOf[Expand] &&
>             !o.isInstanceOf[Generate] &&
>             // Lateral join is checked in checkSubqueryExpression.
>             !o.isInstanceOf[LateralJoin] =>
>             // The rule above is used to check Aggregate operator.
>             o.failAnalysis(
>               errorClass = "INVALID_NON_DETERMINISTIC_EXPRESSIONS",
>               messageParameters = Map("sqlExprs" -> 
> o.expressions.map(toSQLExpr(_)).mkString(", "))
>             ){code}
>  
> It would be nice to add a generic trait/class to this case match that is 
> allow listed so that when new non-deterministic expressions that live in 
> other repositories needs to be allow listed, we don't need to wait for a new 
> spark release. For example, in Delta Lake, we want to allow list a specific 
> non-deterministic expression for the DeltaMergeIntoMatchedUpdateClause 
> operator as part of Delta's [Identity Column 
> implementation.|https://github.com/delta-io/delta/issues/1959]It is cleaner 
> overall to add an abstract generic class there than to put Delta specific 
> logic into this CheckAnalysis rule.  
> It would be beneficial to backport this to Spark 3.5 so that we don't need to 
> wait for the Spark 4 to benefit from this low risk change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48824) Add SQL syntax in create/replace table to create an identity column

2024-07-05 Thread Carmen Kwan (Jira)
Carmen Kwan created SPARK-48824:
---

 Summary: Add SQL syntax in create/replace table to create an 
identity column
 Key: SPARK-48824
 URL: https://issues.apache.org/jira/browse/SPARK-48824
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Carmen Kwan


Add SQL support for creating identity columns. Identity Column syntax should be 
flexible such that users can specify 
 * whether identity values are always generated by the system
 * (optionally) the starting value of the column 
 * (optionally) the increment/step of the column

The SQL syntax support should also allow flexible ordering of the increment and 
starting values, as both variants are used in the wild by other systems (e.g. 
[PostgreSQL|https://www.postgresql.org/docs/current/sql-createsequence.html] 
[Oracle).|https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571]
 That is, we should allow both
{code:java}
START WITH  INCREMENT BY {code}
and 
{code:java}
INCREMENT BY  START WITH {code}
.

For example, we should be able to define
{code:java}
CREATE TABLE default.example (
  id LONG GENERATED ALWAYS AS IDENTITY,
  id2 LONG GENERATED BY DEFAULT START WITH 0 INCREMENT BY -10,
  id3 LONG GENERATED ALWAYS AS IDENTITY INCREMENT BY 2 START WITH -8,
  value LONG
)
{code}
This will enable defining identity columns in Spark SQL for data sources that 
support it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48473) Add extensible trait to allow-list non-deterministic expressions in operators in CheckAnalysis

2024-05-30 Thread Carmen Kwan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carmen Kwan updated SPARK-48473:

Component/s: SQL
 (was: Spark Core)

> Add extensible trait to allow-list non-deterministic expressions in operators 
> in CheckAnalysis
> --
>
> Key: SPARK-48473
> URL: https://issues.apache.org/jira/browse/SPARK-48473
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Carmen Kwan
>Priority: Major
>
> CheckAnalysis throws an `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception 
> when there is a non-deterministic expression within an operator that is not 
> allow listed in the case match check 
> [below|https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L773-L784]:
>  
> {code:java}
>  case o if o.expressions.exists(!_.deterministic) &&
>             !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>             !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] &&
>             !o.isInstanceOf[Expand] &&
>             !o.isInstanceOf[Generate] &&
>             // Lateral join is checked in checkSubqueryExpression.
>             !o.isInstanceOf[LateralJoin] =>
>             // The rule above is used to check Aggregate operator.
>             o.failAnalysis(
>               errorClass = "INVALID_NON_DETERMINISTIC_EXPRESSIONS",
>               messageParameters = Map("sqlExprs" -> 
> o.expressions.map(toSQLExpr(_)).mkString(", "))
>             ){code}
>  
> It would be nice to add a generic trait/class to this case match that is 
> allow listed so that when new non-deterministic expressions that live in 
> other repositories needs to be allow listed, we don't need to wait for a new 
> spark release. For example, in Delta Lake, we want to allow list a specific 
> non-deterministic expression for the DeltaMergeIntoMatchedUpdateClause 
> operator as part of Delta's [Identity Column 
> implementation.|https://github.com/delta-io/delta/issues/1959]It is cleaner 
> overall to add an abstract generic class there than to put Delta specific 
> logic into this CheckAnalysis rule.  
> It would be beneficial to backport this to Spark 3.5 so that we don't need to 
> wait for the Spark 4 to benefit from this low risk change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48473) Add extensible trait to allow-list non-deterministic expressions in operators in CheckAnalysis

2024-05-30 Thread Carmen Kwan (Jira)
Carmen Kwan created SPARK-48473:
---

 Summary: Add extensible trait to allow-list non-deterministic 
expressions in operators in CheckAnalysis
 Key: SPARK-48473
 URL: https://issues.apache.org/jira/browse/SPARK-48473
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0, 3.5.2
Reporter: Carmen Kwan


CheckAnalysis throws an `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception when 
there is a non-deterministic expression within an operator that is not allow 
listed in the case match check 
[below|https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L773-L784]:
 
{code:java}
 case o if o.expressions.exists(!_.deterministic) &&
            !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
            !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] &&
            !o.isInstanceOf[Expand] &&
            !o.isInstanceOf[Generate] &&
            // Lateral join is checked in checkSubqueryExpression.
            !o.isInstanceOf[LateralJoin] =>
            // The rule above is used to check Aggregate operator.
            o.failAnalysis(
              errorClass = "INVALID_NON_DETERMINISTIC_EXPRESSIONS",
              messageParameters = Map("sqlExprs" -> 
o.expressions.map(toSQLExpr(_)).mkString(", "))
            ){code}
 

It would be nice to add a generic trait/class to this case match that is allow 
listed so that when new non-deterministic expressions that live in other 
repositories needs to be allow listed, we don't need to wait for a new spark 
release. For example, in Delta Lake, we want to allow list a specific 
non-deterministic expression for the DeltaMergeIntoMatchedUpdateClause operator 
as part of Delta's [Identity Column 
implementation.|https://github.com/delta-io/delta/issues/1959]It is cleaner 
overall to add an abstract generic class there than to put Delta specific logic 
into this CheckAnalysis rule.  

It would be beneficial to backport this to Spark 3.5 so that we don't need to 
wait for the Spark 4 to benefit from this low risk change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40315) Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects

2022-09-02 Thread Carmen Kwan (Jira)
Carmen Kwan created SPARK-40315:
---

 Summary: Non-deterministic hashCode() calculations for 
ArrayBasedMapData on equal objects
 Key: SPARK-40315
 URL: https://issues.apache.org/jira/browse/SPARK-40315
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.2
Reporter: Carmen Kwan


There is no explicit `hashCode()` function override for the `ArrayBasedMapData` 
LogicalPlan. As a result, the `hashCode()` computed for `ArrayBasedMapData` can 
be different for two equal objects (objects with equal keys and values).

This error is non-deterministic and hard to reproduce, as we don't control the 
default `hashCode()` function.

We should override the `hashCode` function so that it works exactly as we 
expect. We should also have an explicit `equals()` function for consistency 
with how `Literals` check for equality of `ArrayBasedMapData`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org