[jira] [Assigned] (SPARK-42870) Move `toCatalystValue` to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42870: Assignee: Apache Spark > Move `toCatalystValue` to connect-common > > > Key: SPARK-42870 > URL: https://issues.apache.org/jira/browse/SPARK-42870 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42868) Support eliminate sorts in AQE Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702545#comment-17702545 ] Apache Spark commented on SPARK-42868: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40484 > Support eliminate sorts in AQE Optimizer > > > Key: SPARK-42868 > URL: https://issues.apache.org/jira/browse/SPARK-42868 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42868) Support eliminate sorts in AQE Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42868: Assignee: Apache Spark > Support eliminate sorts in AQE Optimizer > > > Key: SPARK-42868 > URL: https://issues.apache.org/jira/browse/SPARK-42868 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42868) Support eliminate sorts in AQE Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42868: Assignee: (was: Apache Spark) > Support eliminate sorts in AQE Optimizer > > > Key: SPARK-42868 > URL: https://issues.apache.org/jira/browse/SPARK-42868 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42809) Upgrade scala-maven-plugin from 4.8.0 to 4.8.1
[ https://issues.apache.org/jira/browse/SPARK-42809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702453#comment-17702453 ] Apache Spark commented on SPARK-42809: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40482 > Upgrade scala-maven-plugin from 4.8.0 to 4.8.1 > -- > > Key: SPARK-42809 > URL: https://issues.apache.org/jira/browse/SPARK-42809 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42827) Support `functions#array_prepend`
[ https://issues.apache.org/jira/browse/SPARK-42827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42827: Assignee: (was: Apache Spark) > Support `functions#array_prepend` > - > > Key: SPARK-42827 > URL: https://issues.apache.org/jira/browse/SPARK-42827 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > Wait for SPARK-41233 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42827) Support `functions#array_prepend`
[ https://issues.apache.org/jira/browse/SPARK-42827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42827: Assignee: Apache Spark > Support `functions#array_prepend` > - > > Key: SPARK-42827 > URL: https://issues.apache.org/jira/browse/SPARK-42827 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Wait for SPARK-41233 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42827) Support `functions#array_prepend`
[ https://issues.apache.org/jira/browse/SPARK-42827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702398#comment-17702398 ] Apache Spark commented on SPARK-42827: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40481 > Support `functions#array_prepend` > - > > Key: SPARK-42827 > URL: https://issues.apache.org/jira/browse/SPARK-42827 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > Wait for SPARK-41233 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702396#comment-17702396 ] Apache Spark commented on SPARK-42508: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40480 > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42779) Allow V2 writes to indicate advisory partition size
[ https://issues.apache.org/jira/browse/SPARK-42779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702351#comment-17702351 ] Apache Spark commented on SPARK-42779: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/40478 > Allow V2 writes to indicate advisory partition size > --- > > Key: SPARK-42779 > URL: https://issues.apache.org/jira/browse/SPARK-42779 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > Data sources may request a particular distribution and ordering of data for > V2 writes. If AQE is enabled, the default session advisory partition size > (64MB) will be used as guidance. Unfortunately, this default value can still > lead to small files because the written data can be compressed nicely using > columnar file formats. Spark should allow data sources to indicate the > advisory shuffle partition size, just like it lets data sources request a > particular number of partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42805) 'Conflicting attributes' exception is thrown when joining checkpointed dataframe
[ https://issues.apache.org/jira/browse/SPARK-42805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42805: Assignee: (was: Apache Spark) > 'Conflicting attributes' exception is thrown when joining checkpointed > dataframe > > > Key: SPARK-42805 > URL: https://issues.apache.org/jira/browse/SPARK-42805 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2 >Reporter: Maciej Smolenski >Priority: Major > > Performing join using checkpointed dataframe leads to error in prepared > 'execution plan' because columns ids/names in 'execution plan' are not unique. > This issue can be reproduced with this simple code (fails on 3.3.2, succeeds > on 3.1.2): > {code:java} > import spark.implicits._ > spark.sparkContext.setCheckpointDir("file:///tmp/cdir") > val df = spark.range(10).toDF("id") > val cdf = df.checkpoint() > cdf.join(df) // org.apache.spark.sql.AnalysisException thrown on 3.3.2 {code} > > The failure message is: > {noformat} > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16))Conflicting attributes: id#2L > ; > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16)) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:540) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:97) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:211) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185) > at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3887) > at org.apache.spark.sql.Dataset.join(Dataset.scala:920) > ... 49 elided > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42805) 'Conflicting attributes' exception is thrown when joining checkpointed dataframe
[ https://issues.apache.org/jira/browse/SPARK-42805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702306#comment-17702306 ] Apache Spark commented on SPARK-42805: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/40477 > 'Conflicting attributes' exception is thrown when joining checkpointed > dataframe > > > Key: SPARK-42805 > URL: https://issues.apache.org/jira/browse/SPARK-42805 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2 >Reporter: Maciej Smolenski >Priority: Major > > Performing join using checkpointed dataframe leads to error in prepared > 'execution plan' because columns ids/names in 'execution plan' are not unique. > This issue can be reproduced with this simple code (fails on 3.3.2, succeeds > on 3.1.2): > {code:java} > import spark.implicits._ > spark.sparkContext.setCheckpointDir("file:///tmp/cdir") > val df = spark.range(10).toDF("id") > val cdf = df.checkpoint() > cdf.join(df) // org.apache.spark.sql.AnalysisException thrown on 3.3.2 {code} > > The failure message is: > {noformat} > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16))Conflicting attributes: id#2L > ; > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16)) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:540) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:97) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:211) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185) > at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3887) > at org.apache.spark.sql.Dataset.join(Dataset.scala:920) > ... 49 elided > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42805) 'Conflicting attributes' exception is thrown when joining checkpointed dataframe
[ https://issues.apache.org/jira/browse/SPARK-42805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702305#comment-17702305 ] Apache Spark commented on SPARK-42805: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/40477 > 'Conflicting attributes' exception is thrown when joining checkpointed > dataframe > > > Key: SPARK-42805 > URL: https://issues.apache.org/jira/browse/SPARK-42805 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2 >Reporter: Maciej Smolenski >Priority: Major > > Performing join using checkpointed dataframe leads to error in prepared > 'execution plan' because columns ids/names in 'execution plan' are not unique. > This issue can be reproduced with this simple code (fails on 3.3.2, succeeds > on 3.1.2): > {code:java} > import spark.implicits._ > spark.sparkContext.setCheckpointDir("file:///tmp/cdir") > val df = spark.range(10).toDF("id") > val cdf = df.checkpoint() > cdf.join(df) // org.apache.spark.sql.AnalysisException thrown on 3.3.2 {code} > > The failure message is: > {noformat} > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16))Conflicting attributes: id#2L > ; > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16)) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:540) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:97) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:211) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185) > at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3887) > at org.apache.spark.sql.Dataset.join(Dataset.scala:920) > ... 49 elided > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42805) 'Conflicting attributes' exception is thrown when joining checkpointed dataframe
[ https://issues.apache.org/jira/browse/SPARK-42805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42805: Assignee: Apache Spark > 'Conflicting attributes' exception is thrown when joining checkpointed > dataframe > > > Key: SPARK-42805 > URL: https://issues.apache.org/jira/browse/SPARK-42805 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2 >Reporter: Maciej Smolenski >Assignee: Apache Spark >Priority: Major > > Performing join using checkpointed dataframe leads to error in prepared > 'execution plan' because columns ids/names in 'execution plan' are not unique. > This issue can be reproduced with this simple code (fails on 3.3.2, succeeds > on 3.1.2): > {code:java} > import spark.implicits._ > spark.sparkContext.setCheckpointDir("file:///tmp/cdir") > val df = spark.range(10).toDF("id") > val cdf = df.checkpoint() > cdf.join(df) // org.apache.spark.sql.AnalysisException thrown on 3.3.2 {code} > > The failure message is: > {noformat} > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16))Conflicting attributes: id#2L > ; > 'Join Inner > :- LogicalRDD [id#2L], false > +- Project [id#0L AS id#2L] > +- Range (0, 10, step=1, splits=Some(16)) at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:540) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:97) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:211) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185) > at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3887) > at org.apache.spark.sql.Dataset.join(Dataset.scala:920) > ... 49 elided > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42853) Update the Spark Doc to match the new website style
[ https://issues.apache.org/jira/browse/SPARK-42853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42853: Assignee: (was: Apache Spark) > Update the Spark Doc to match the new website style > --- > > Key: SPARK-42853 > URL: https://issues.apache.org/jira/browse/SPARK-42853 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42853) Update the Spark Doc to match the new website style
[ https://issues.apache.org/jira/browse/SPARK-42853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42853: Assignee: Apache Spark > Update the Spark Doc to match the new website style > --- > > Key: SPARK-42853 > URL: https://issues.apache.org/jira/browse/SPARK-42853 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42853) Update the Spark Doc to match the new website style
[ https://issues.apache.org/jira/browse/SPARK-42853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702243#comment-17702243 ] Apache Spark commented on SPARK-42853: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40269 > Update the Spark Doc to match the new website style > --- > > Key: SPARK-42853 > URL: https://issues.apache.org/jira/browse/SPARK-42853 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702139#comment-17702139 ] Apache Spark commented on SPARK-42852: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/40475 > Revert NamedLambdaVariable related changes from EquivalentExpressions > - > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Peter Toth >Priority: Major > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42852: Assignee: Apache Spark > Revert NamedLambdaVariable related changes from EquivalentExpressions > - > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42852: Assignee: (was: Apache Spark) > Revert NamedLambdaVariable related changes from EquivalentExpressions > - > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Peter Toth >Priority: Major > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702069#comment-17702069 ] Apache Spark commented on SPARK-42849: -- User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/40474 > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Priority: Major > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42849: Assignee: Apache Spark > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702068#comment-17702068 ] Apache Spark commented on SPARK-42849: -- User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/40474 > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Priority: Major > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42849: Assignee: (was: Apache Spark) > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Priority: Major > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression
[ https://issues.apache.org/jira/browse/SPARK-42851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42851: Assignee: Apache Spark > EquivalentExpressions methods need to be consistently guarded by > supportedExpression > > > Key: SPARK-42851 > URL: https://issues.apache.org/jira/browse/SPARK-42851 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > SPARK-41468 tried to fix a bug but introduced a new regression. Its change to > {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the > {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same > guard to the other "add" entry point -- {{addExpr()}}. > As such, uses that add single expressions to CSE via {{addExpr()}} may > succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a > {{None}} due to failing the guard. > We need to make sure the "add" and "get" methods are consistent. It could be > done by one of: > 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or > 2. Removing the guard from {{getExprState()}}, relying solely on the guard on > the "add" path to make sure only intended state is added. > (or other alternative refactorings to fuse the guard into various methods to > make it more efficient) > There are pros and cons to the two directions above, because {{addExpr()}} > used to allow (potentially incorrect) more expressions to get CSE'd, making > it more restrictive may cause performance regressions (for the cases that > happened to work). > Example: > {code:sql} > select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) > from range(2) > {code} > Running this query on Spark 3.2 branch returns the correct value: > {code} > scala> spark.sql("select max(transform(array(id), x -> x)), > max(transform(array(id), x -> x)) from range(2)").collect > res0: Array[org.apache.spark.sql.Row] = > Array([WrappedArray(1),WrappedArray(1)]) > {code} > Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was > (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, > and {{getExprState()}} doesn't do extra guarding, so during physical > planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the > aggregation expression list and the result expressions list. > {code} > AdaptiveSparkPlan isFinalPlan=false > +- SortAggregate(key=[], functions=[max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) > +- Range (0, 2, step=1, splits=16) > {code} > Running the same query on current master triggers an error when binding the > result expression to the aggregate expression in the Aggregate operators (for > a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show > up during codegen): > {code} > ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): > java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), > lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in > [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, > false)))#3] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532) > at >
[jira] [Assigned] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression
[ https://issues.apache.org/jira/browse/SPARK-42851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42851: Assignee: (was: Apache Spark) > EquivalentExpressions methods need to be consistently guarded by > supportedExpression > > > Key: SPARK-42851 > URL: https://issues.apache.org/jira/browse/SPARK-42851 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Kris Mok >Priority: Major > > SPARK-41468 tried to fix a bug but introduced a new regression. Its change to > {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the > {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same > guard to the other "add" entry point -- {{addExpr()}}. > As such, uses that add single expressions to CSE via {{addExpr()}} may > succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a > {{None}} due to failing the guard. > We need to make sure the "add" and "get" methods are consistent. It could be > done by one of: > 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or > 2. Removing the guard from {{getExprState()}}, relying solely on the guard on > the "add" path to make sure only intended state is added. > (or other alternative refactorings to fuse the guard into various methods to > make it more efficient) > There are pros and cons to the two directions above, because {{addExpr()}} > used to allow (potentially incorrect) more expressions to get CSE'd, making > it more restrictive may cause performance regressions (for the cases that > happened to work). > Example: > {code:sql} > select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) > from range(2) > {code} > Running this query on Spark 3.2 branch returns the correct value: > {code} > scala> spark.sql("select max(transform(array(id), x -> x)), > max(transform(array(id), x -> x)) from range(2)").collect > res0: Array[org.apache.spark.sql.Row] = > Array([WrappedArray(1),WrappedArray(1)]) > {code} > Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was > (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, > and {{getExprState()}} doesn't do extra guarding, so during physical > planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the > aggregation expression list and the result expressions list. > {code} > AdaptiveSparkPlan isFinalPlan=false > +- SortAggregate(key=[], functions=[max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) > +- Range (0, 2, step=1, splits=16) > {code} > Running the same query on current master triggers an error when binding the > result expression to the aggregate expression in the Aggregate operators (for > a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show > up during codegen): > {code} > ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): > java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), > lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in > [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, > false)))#3] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517) > at
[jira] [Commented] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression
[ https://issues.apache.org/jira/browse/SPARK-42851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702033#comment-17702033 ] Apache Spark commented on SPARK-42851: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/40473 > EquivalentExpressions methods need to be consistently guarded by > supportedExpression > > > Key: SPARK-42851 > URL: https://issues.apache.org/jira/browse/SPARK-42851 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Kris Mok >Priority: Major > > SPARK-41468 tried to fix a bug but introduced a new regression. Its change to > {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the > {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same > guard to the other "add" entry point -- {{addExpr()}}. > As such, uses that add single expressions to CSE via {{addExpr()}} may > succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a > {{None}} due to failing the guard. > We need to make sure the "add" and "get" methods are consistent. It could be > done by one of: > 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or > 2. Removing the guard from {{getExprState()}}, relying solely on the guard on > the "add" path to make sure only intended state is added. > (or other alternative refactorings to fuse the guard into various methods to > make it more efficient) > There are pros and cons to the two directions above, because {{addExpr()}} > used to allow (potentially incorrect) more expressions to get CSE'd, making > it more restrictive may cause performance regressions (for the cases that > happened to work). > Example: > {code:sql} > select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) > from range(2) > {code} > Running this query on Spark 3.2 branch returns the correct value: > {code} > scala> spark.sql("select max(transform(array(id), x -> x)), > max(transform(array(id), x -> x)) from range(2)").collect > res0: Array[org.apache.spark.sql.Row] = > Array([WrappedArray(1),WrappedArray(1)]) > {code} > Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was > (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, > and {{getExprState()}} doesn't do extra guarding, so during physical > planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the > aggregation expression list and the result expressions list. > {code} > AdaptiveSparkPlan isFinalPlan=false > +- SortAggregate(key=[], functions=[max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), > lambdafunction(lambda x#1L, lambda x#1L, false)))]) > +- Range (0, 2, step=1, splits=16) > {code} > Running the same query on current master triggers an error when binding the > result expression to the aggregate expression in the Aggregate operators (for > a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show > up during codegen): > {code} > ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): > java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), > lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in > [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, > false)))#3] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532) > at >
[jira] [Assigned] (SPARK-42247) Standardize `returnType` property of UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-42247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42247: Assignee: (was: Apache Spark) > Standardize `returnType` property of UserDefinedFunction > > > Key: SPARK-42247 > URL: https://issues.apache.org/jira/browse/SPARK-42247 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > There are checks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42247) Standardize `returnType` property of UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-42247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702022#comment-17702022 ] Apache Spark commented on SPARK-42247: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40472 > Standardize `returnType` property of UserDefinedFunction > > > Key: SPARK-42247 > URL: https://issues.apache.org/jira/browse/SPARK-42247 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > There are checks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42247) Standardize `returnType` property of UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-42247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42247: Assignee: Apache Spark > Standardize `returnType` property of UserDefinedFunction > > > Key: SPARK-42247 > URL: https://issues.apache.org/jira/browse/SPARK-42247 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > There are checks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42850) Remove duplicated rule CombineFilters in Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42850: Assignee: Apache Spark (was: Gengliang Wang) > Remove duplicated rule CombineFilters in Optimizer > -- > > Key: SPARK-42850 > URL: https://issues.apache.org/jira/browse/SPARK-42850 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42850) Remove duplicated rule CombineFilters in Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42850: Assignee: Gengliang Wang (was: Apache Spark) > Remove duplicated rule CombineFilters in Optimizer > -- > > Key: SPARK-42850 > URL: https://issues.apache.org/jira/browse/SPARK-42850 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42850) Remove duplicated rule CombineFilters in Optimizer
[ https://issues.apache.org/jira/browse/SPARK-42850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702021#comment-17702021 ] Apache Spark commented on SPARK-42850: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40471 > Remove duplicated rule CombineFilters in Optimizer > -- > > Key: SPARK-42850 > URL: https://issues.apache.org/jira/browse/SPARK-42850 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702006#comment-17702006 ] Apache Spark commented on SPARK-41843: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40470 > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702003#comment-17702003 ] Apache Spark commented on SPARK-41818: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40470 > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702004#comment-17702004 ] Apache Spark commented on SPARK-41843: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40470 > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702005#comment-17702005 ] Apache Spark commented on SPARK-41843: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40470 > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702002#comment-17702002 ] Apache Spark commented on SPARK-41818: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40470 > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42848) Implement DataFrame.registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-42848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701934#comment-17701934 ] Apache Spark commented on SPARK-42848: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40469 > Implement DataFrame.registerTempTable > - > > Key: SPARK-42848 > URL: https://issues.apache.org/jira/browse/SPARK-42848 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42848) Implement DataFrame.registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-42848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701933#comment-17701933 ] Apache Spark commented on SPARK-42848: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40469 > Implement DataFrame.registerTempTable > - > > Key: SPARK-42848 > URL: https://issues.apache.org/jira/browse/SPARK-42848 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42848) Implement DataFrame.registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-42848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42848: Assignee: Apache Spark > Implement DataFrame.registerTempTable > - > > Key: SPARK-42848 > URL: https://issues.apache.org/jira/browse/SPARK-42848 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42848) Implement DataFrame.registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-42848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42848: Assignee: (was: Apache Spark) > Implement DataFrame.registerTempTable > - > > Key: SPARK-42848 > URL: https://issues.apache.org/jira/browse/SPARK-42848 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42833) Refactor `applyExtensions` in `SparkSession`
[ https://issues.apache.org/jira/browse/SPARK-42833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701903#comment-17701903 ] Apache Spark commented on SPARK-42833: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/40465 > Refactor `applyExtensions` in `SparkSession` > > > Key: SPARK-42833 > URL: https://issues.apache.org/jira/browse/SPARK-42833 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Refactor `applyExtensions` in `SparkSession` in order to reduce the > duplicated codes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42584) Improve output of Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42584: Assignee: Apache Spark > Improve output of Column.explain > > > Key: SPARK-42584 > URL: https://issues.apache.org/jira/browse/SPARK-42584 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > We currently display the structure of the proto in both the regular and > extended version of explain. We should display a more compact sql-a-like > string for the regular version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42584) Improve output of Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42584: Assignee: (was: Apache Spark) > Improve output of Column.explain > > > Key: SPARK-42584 > URL: https://issues.apache.org/jira/browse/SPARK-42584 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > We currently display the structure of the proto in both the regular and > extended version of explain. We should display a more compact sql-a-like > string for the regular version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42584) Improve output of Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701690#comment-17701690 ] Apache Spark commented on SPARK-42584: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40467 > Improve output of Column.explain > > > Key: SPARK-42584 > URL: https://issues.apache.org/jira/browse/SPARK-42584 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > We currently display the structure of the proto in both the regular and > extended version of explain. We should display a more compact sql-a-like > string for the regular version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42835) Add test cases for Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701623#comment-17701623 ] Apache Spark commented on SPARK-42835: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40466 > Add test cases for Column.explain > - > > Key: SPARK-42835 > URL: https://issues.apache.org/jira/browse/SPARK-42835 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42835) Add test cases for Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42835: Assignee: (was: Apache Spark) > Add test cases for Column.explain > - > > Key: SPARK-42835 > URL: https://issues.apache.org/jira/browse/SPARK-42835 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42835) Add test cases for Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701621#comment-17701621 ] Apache Spark commented on SPARK-42835: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40466 > Add test cases for Column.explain > - > > Key: SPARK-42835 > URL: https://issues.apache.org/jira/browse/SPARK-42835 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42835) Add test cases for Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42835: Assignee: Apache Spark > Add test cases for Column.explain > - > > Key: SPARK-42835 > URL: https://issues.apache.org/jira/browse/SPARK-42835 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42833) Refactor `applyExtensions` in `SparkSession`
[ https://issues.apache.org/jira/browse/SPARK-42833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42833: Assignee: Apache Spark > Refactor `applyExtensions` in `SparkSession` > > > Key: SPARK-42833 > URL: https://issues.apache.org/jira/browse/SPARK-42833 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Minor > > Refactor `applyExtensions` in `SparkSession` in order to reduce the > duplicated codes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42833) Refactor `applyExtensions` in `SparkSession`
[ https://issues.apache.org/jira/browse/SPARK-42833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42833: Assignee: (was: Apache Spark) > Refactor `applyExtensions` in `SparkSession` > > > Key: SPARK-42833 > URL: https://issues.apache.org/jira/browse/SPARK-42833 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Refactor `applyExtensions` in `SparkSession` in order to reduce the > duplicated codes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701527#comment-17701527 ] Apache Spark commented on SPARK-42557: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40463 > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.1 > > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42557) Add Broadcast to functions
[ https://issues.apache.org/jira/browse/SPARK-42557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701526#comment-17701526 ] Apache Spark commented on SPARK-42557: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40463 > Add Broadcast to functions > -- > > Key: SPARK-42557 > URL: https://issues.apache.org/jira/browse/SPARK-42557 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.1 > > > Add the {{broadcast}} function to functions.scala. Please check if we can get > the same semantics as the current implementation using unresolved hints. > https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1246-L1261 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701478#comment-17701478 ] Apache Spark commented on SPARK-42832: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40462 > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42832: Assignee: (was: Apache Spark) > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42832: Assignee: Apache Spark > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701471#comment-17701471 ] Apache Spark commented on SPARK-42831: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/40461 > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42831: Assignee: (was: Apache Spark) > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701470#comment-17701470 ] Apache Spark commented on SPARK-42831: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/40461 > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42831: Assignee: Apache Spark > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42828: Assignee: Apache Spark > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Assignee: Apache Spark >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42828: Assignee: (was: Apache Spark) > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701345#comment-17701345 ] Apache Spark commented on SPARK-42828: -- User 'j03wang' has created a pull request for this issue: https://github.com/apache/spark/pull/40460 > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42826: Assignee: (was: Apache Spark) > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42826: Assignee: Apache Spark > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701295#comment-17701295 ] Apache Spark commented on SPARK-42826: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40459 > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42824: Assignee: Apache Spark > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42824: Assignee: (was: Apache Spark) > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701220#comment-17701220 ] Apache Spark commented on SPARK-42824: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40458 > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41233: Assignee: (was: Apache Spark) > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41233: Assignee: Apache Spark > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42823: Assignee: Apache Spark > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42823: Assignee: (was: Apache Spark) > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701078#comment-17701078 ] Apache Spark commented on SPARK-42823: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40457 > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701051#comment-17701051 ] Apache Spark commented on SPARK-42720: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40456 > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42720: Assignee: (was: Apache Spark) > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42720: Assignee: Apache Spark > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701050#comment-17701050 ] Apache Spark commented on SPARK-42720: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40456 > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700994#comment-17700994 ] Apache Spark commented on SPARK-42819: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40455 > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42819: Assignee: Apache Spark > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Apache Spark >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42819: Assignee: (was: Apache Spark) > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700993#comment-17700993 ] Apache Spark commented on SPARK-42819: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40455 > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42821) Remove unused parameters in splitFiles methods
[ https://issues.apache.org/jira/browse/SPARK-42821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700940#comment-17700940 ] Apache Spark commented on SPARK-42821: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40454 > Remove unused parameters in splitFiles methods > -- > > Key: SPARK-42821 > URL: https://issues.apache.org/jira/browse/SPARK-42821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42821) Remove unused parameters in splitFiles methods
[ https://issues.apache.org/jira/browse/SPARK-42821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700939#comment-17700939 ] Apache Spark commented on SPARK-42821: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40454 > Remove unused parameters in splitFiles methods > -- > > Key: SPARK-42821 > URL: https://issues.apache.org/jira/browse/SPARK-42821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42821) Remove unused parameters in splitFiles methods
[ https://issues.apache.org/jira/browse/SPARK-42821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42821: Assignee: Apache Spark > Remove unused parameters in splitFiles methods > -- > > Key: SPARK-42821 > URL: https://issues.apache.org/jira/browse/SPARK-42821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42821) Remove unused parameters in splitFiles methods
[ https://issues.apache.org/jira/browse/SPARK-42821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42821: Assignee: (was: Apache Spark) > Remove unused parameters in splitFiles methods > -- > > Key: SPARK-42821 > URL: https://issues.apache.org/jira/browse/SPARK-42821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42820) Update ORC to 1.8.3
[ https://issues.apache.org/jira/browse/SPARK-42820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700934#comment-17700934 ] Apache Spark commented on SPARK-42820: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40453 > Update ORC to 1.8.3 > --- > > Key: SPARK-42820 > URL: https://issues.apache.org/jira/browse/SPARK-42820 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42820) Update ORC to 1.8.3
[ https://issues.apache.org/jira/browse/SPARK-42820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42820: Assignee: (was: Apache Spark) > Update ORC to 1.8.3 > --- > > Key: SPARK-42820 > URL: https://issues.apache.org/jira/browse/SPARK-42820 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42820) Update ORC to 1.8.3
[ https://issues.apache.org/jira/browse/SPARK-42820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42820: Assignee: Apache Spark > Update ORC to 1.8.3 > --- > > Key: SPARK-42820 > URL: https://issues.apache.org/jira/browse/SPARK-42820 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42820) Update ORC to 1.8.3
[ https://issues.apache.org/jira/browse/SPARK-42820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700933#comment-17700933 ] Apache Spark commented on SPARK-42820: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40453 > Update ORC to 1.8.3 > --- > > Key: SPARK-42820 > URL: https://issues.apache.org/jira/browse/SPARK-42820 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42818) Implement DataFrameReader/Writer.jdbc
[ https://issues.apache.org/jira/browse/SPARK-42818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700925#comment-17700925 ] Apache Spark commented on SPARK-42818: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40451 > Implement DataFrameReader/Writer.jdbc > - > > Key: SPARK-42818 > URL: https://issues.apache.org/jira/browse/SPARK-42818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42818) Implement DataFrameReader/Writer.jdbc
[ https://issues.apache.org/jira/browse/SPARK-42818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42818: Assignee: Apache Spark > Implement DataFrameReader/Writer.jdbc > - > > Key: SPARK-42818 > URL: https://issues.apache.org/jira/browse/SPARK-42818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42818) Implement DataFrameReader/Writer.jdbc
[ https://issues.apache.org/jira/browse/SPARK-42818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42818: Assignee: (was: Apache Spark) > Implement DataFrameReader/Writer.jdbc > - > > Key: SPARK-42818 > URL: https://issues.apache.org/jira/browse/SPARK-42818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42818) Implement DataFrameReader/Writer.jdbc
[ https://issues.apache.org/jira/browse/SPARK-42818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700877#comment-17700877 ] Apache Spark commented on SPARK-42818: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40450 > Implement DataFrameReader/Writer.jdbc > - > > Key: SPARK-42818 > URL: https://issues.apache.org/jira/browse/SPARK-42818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42791) Create golden file test framework for analysis
[ https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42791: Assignee: (was: Apache Spark) > Create golden file test framework for analysis > -- > > Key: SPARK-42791 > URL: https://issues.apache.org/jira/browse/SPARK-42791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > Here we track the work to add new golden file test support for the Spark > analyzer. Each golden file can contain a list of SQL queries followed by the > string representations of their analyzed logical plans. > > This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping > after analysis and listing analyzed plans as the results instead of fully > executing queries end-to-end. As another example, ZetaSQL has analyzer-based > golden file testing like this as well [2]. > > This way, any changes to analysis will show up as test diffs, which are easy > to spot in review and also easy to update automatically. This could help the > community together maintain the qualify of Apache Spark's query analysis. > > [1] > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala] > > [2] > [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42791) Create golden file test framework for analysis
[ https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700809#comment-17700809 ] Apache Spark commented on SPARK-42791: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/40449 > Create golden file test framework for analysis > -- > > Key: SPARK-42791 > URL: https://issues.apache.org/jira/browse/SPARK-42791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > Here we track the work to add new golden file test support for the Spark > analyzer. Each golden file can contain a list of SQL queries followed by the > string representations of their analyzed logical plans. > > This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping > after analysis and listing analyzed plans as the results instead of fully > executing queries end-to-end. As another example, ZetaSQL has analyzer-based > golden file testing like this as well [2]. > > This way, any changes to analysis will show up as test diffs, which are easy > to spot in review and also easy to update automatically. This could help the > community together maintain the qualify of Apache Spark's query analysis. > > [1] > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala] > > [2] > [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42791) Create golden file test framework for analysis
[ https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42791: Assignee: Apache Spark > Create golden file test framework for analysis > -- > > Key: SPARK-42791 > URL: https://issues.apache.org/jira/browse/SPARK-42791 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > > Here we track the work to add new golden file test support for the Spark > analyzer. Each golden file can contain a list of SQL queries followed by the > string representations of their analyzed logical plans. > > This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping > after analysis and listing analyzed plans as the results instead of fully > executing queries end-to-end. As another example, ZetaSQL has analyzer-based > golden file testing like this as well [2]. > > This way, any changes to analysis will show up as test diffs, which are easy > to spot in review and also easy to update automatically. This could help the > community together maintain the qualify of Apache Spark's query analysis. > > [1] > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala] > > [2] > [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name
[ https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700808#comment-17700808 ] Apache Spark commented on SPARK-42817: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/40448 > Spark driver logs are filled with Initializing service data for shuffle > service using name > -- > > Key: SPARK-42817 > URL: https://issues.apache.org/jira/browse/SPARK-42817 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Priority: Major > > With SPARK-34828, we added the ability to make the shuffle service name > configurable and we added a log > [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118] > that will log the shuffle service name. However, this log is printed in the > driver logs whenever there is new executor launched and pollutes the log. > {code} > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > {code} > We can just log this once in the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name
[ https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42817: Assignee: (was: Apache Spark) > Spark driver logs are filled with Initializing service data for shuffle > service using name > -- > > Key: SPARK-42817 > URL: https://issues.apache.org/jira/browse/SPARK-42817 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Priority: Major > > With SPARK-34828, we added the ability to make the shuffle service name > configurable and we added a log > [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118] > that will log the shuffle service name. However, this log is printed in the > driver logs whenever there is new executor launched and pollutes the log. > {code} > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > {code} > We can just log this once in the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org