[jira] [Updated] (SPARK-38584) Unify the data validation
[ https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-38584: -- Priority: Major (was: Minor) > Unify the data validation > - > > Key: SPARK-38584 > URL: https://issues.apache.org/jira/browse/SPARK-38584 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > 1, input vector validation is missing in most algorithms, when the input > dataset contains some invalid values (NaN/Infinity), then: > * the training may run successfuly and return model containing invalid > coefficients, like LinearSVC > * the training may fail with irrelevant message, like KMeans > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0.toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] > val km = new KMeans().setK(2) > scala> km.fit(df) > 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID > 113) > java.lang.IllegalArgumentException: requirement failed: Both norms should be > greater or equal to 0.0, found norm1=NaN, norm2=Infinity > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) > {code} > > We should make ml algorithms fail fast, if the input dataset is invalid. > > 2, there exists some methods to validate input labels and weights in > different files: > * {{org.apache.spark.ml.functions}} > * org.apache.spark.ml.util.DatasetUtils > * org.apache.spark.ml.util.MetadataUtils, > * org.apache.spark.ml.Predictor > * etc. > > I think it is time to unify realtive methods to one source file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42584) Improve output of Column.explain
[ https://issues.apache.org/jira/browse/SPARK-42584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701498#comment-17701498 ] jiaan.geng commented on SPARK-42584: I will take a look! > Improve output of Column.explain > > > Key: SPARK-42584 > URL: https://issues.apache.org/jira/browse/SPARK-42584 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > We currently display the structure of the proto in both the regular and > extended version of explain. We should display a more compact sql-a-like > string for the regular version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701487#comment-17701487 ] Dongjoon Hyun commented on SPARK-42778: --- This is backported to branch-3.4 via https://github.com/apache/spark/pull/40417 > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42778: -- Fix Version/s: 3.4.0 (was: 3.5.0) > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42778: -- Affects Version/s: 3.4.0 (was: 3.5.0) > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42778: - Assignee: XiDuo You > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42814) Upgrade some maven-plugins
[ https://issues.apache.org/jira/browse/SPARK-42814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42814: - Assignee: Yang Jie > Upgrade some maven-plugins > -- > > Key: SPARK-42814 > URL: https://issues.apache.org/jira/browse/SPARK-42814 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > maven-enforcer-plugin 3.0.0-M2 -> 3.2.1 > - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.2.1] > - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.1.0] > build-helper-maven-plugin 3.2.0 -> 3.3.0 > - > [https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/build-helper-maven-plugin-3.3.0] > maven-compiler-plugin 3.10.1 -> 3.11.0 > - > [https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.11.0] > maven-surefire-plugin 3.0.0-M9 -> 3.0.0 > - [https://github.com/apache/maven-surefire/releases/tag/surefire-3.0.0] > maven-javadoc-plugin 3.4.1 -> 3.5.0 > - > [https://github.com/apache/maven-javadoc-plugin/releases/tag/maven-javadoc-plugin-3.5.0] > maven-deploy-plugin 3.0.0 -> 3.1.0 > - > [https://github.com/apache/maven-deploy-plugin/releases/tag/maven-deploy-plugin-3.1.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42814) Upgrade some maven-plugins
[ https://issues.apache.org/jira/browse/SPARK-42814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42814. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40445 [https://github.com/apache/spark/pull/40445] > Upgrade some maven-plugins > -- > > Key: SPARK-42814 > URL: https://issues.apache.org/jira/browse/SPARK-42814 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > maven-enforcer-plugin 3.0.0-M2 -> 3.2.1 > - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.2.1] > - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.1.0] > build-helper-maven-plugin 3.2.0 -> 3.3.0 > - > [https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/build-helper-maven-plugin-3.3.0] > maven-compiler-plugin 3.10.1 -> 3.11.0 > - > [https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.11.0] > maven-surefire-plugin 3.0.0-M9 -> 3.0.0 > - [https://github.com/apache/maven-surefire/releases/tag/surefire-3.0.0] > maven-javadoc-plugin 3.4.1 -> 3.5.0 > - > [https://github.com/apache/maven-javadoc-plugin/releases/tag/maven-javadoc-plugin-3.5.0] > maven-deploy-plugin 3.0.0 -> 3.1.0 > - > [https://github.com/apache/maven-deploy-plugin/releases/tag/maven-deploy-plugin-3.1.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42823: -- Fix Version/s: 3.4.1 (was: 3.4.0) > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42823: - Assignee: Kent Yao > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42823. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40457 [https://github.com/apache/spark/pull/40457] > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701478#comment-17701478 ] Apache Spark commented on SPARK-42832: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40462 > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42832: Assignee: (was: Apache Spark) > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42832: Assignee: Apache Spark > Remove repartition if it is the child of LocalLimit > --- > > Key: SPARK-42832 > URL: https://issues.apache.org/jira/browse/SPARK-42832 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701471#comment-17701471 ] Apache Spark commented on SPARK-42831: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/40461 > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42831: Assignee: (was: Apache Spark) > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701470#comment-17701470 ] Apache Spark commented on SPARK-42831: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/40461 > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42831: Assignee: Apache Spark > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42831: Description: If the result expressions in AggregateExec are not empty, we should display them. Or we will get confused because some important expressions do not show up in the DAG. For example, the plan for query *SELECT sum(p) from values(cast(23.4 as decimal(7,2))) t(p)* was incorrect because the result expression *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed Before {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], output=[sum#5L]) +- LocalTableScan [p#0] {code} After {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], results=[sum#13L], output=[sum#13L]) +- LocalTableScan [p#0] {code} was: If the result expressions in AggregateExec is non-empty, we should show them. Or we will be confused due to some important expressions did not showed in DAG. For example, the plan of query SELECT sum(p) from values(cast(23.4 as decimal(7,2))) t(p) was not correct because the result expression MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing Before {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], output=[sum#5L]) +- LocalTableScan [p#0] {code} After {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], results=[sum#13L], output=[sum#13L]) +- LocalTableScan [p#0] {code} > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec are not empty, we should display > them. Or we will get confused because some important expressions do not show > up in the DAG. > For example, the plan for query *SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p)* was incorrect because the result expression > *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42832) Remove repartition if it is the child of LocalLimit
Yuming Wang created SPARK-42832: --- Summary: Remove repartition if it is the child of LocalLimit Key: SPARK-42832 URL: https://issues.apache.org/jira/browse/SPARK-42832 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42831) Show result expressions in AggregateExec
[ https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42831: Description: If the result expressions in AggregateExec is non-empty, we should show them. Or we will be confused due to some important expressions did not showed in DAG. For example, the plan of query SELECT sum(p) from values(cast(23.4 as decimal(7,2))) t(p) was not correct because the result expression MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing Before {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], output=[sum#5L]) +- LocalTableScan [p#0] {code} After {code:java} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], output=[sum(p)#2]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], results=[sum#13L], output=[sum#13L]) +- LocalTableScan [p#0] {code} was: If the result expressions in AggregateExec is non-empty, we should show them. Or we will be confused due to some important expressions did not showed in DAG. For example, the plan of query SELECT sum(p) from values(cast(23.4 as decimal(7,2))) t(p) was not correct because the result expression MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing > Show result expressions in AggregateExec > > > Key: SPARK-42831 > URL: https://issues.apache.org/jira/browse/SPARK-42831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Minor > > If the result expressions in AggregateExec is non-empty, we should show them. > Or we will be confused due to some important expressions did not showed in > DAG. > For example, the plan of query SELECT sum(p) from values(cast(23.4 as > decimal(7,2))) t(p) was not correct because the result expression > MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing > Before > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > output=[sum#5L]) > +- LocalTableScan [p#0] > {code} > After > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], > results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], > output=[sum(p)#2]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38] > +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], > results=[sum#13L], output=[sum#13L]) > +- LocalTableScan [p#0] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42831) Show result expressions in AggregateExec
Wan Kun created SPARK-42831: --- Summary: Show result expressions in AggregateExec Key: SPARK-42831 URL: https://issues.apache.org/jira/browse/SPARK-42831 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Wan Kun If the result expressions in AggregateExec is non-empty, we should show them. Or we will be confused due to some important expressions did not showed in DAG. For example, the plan of query SELECT sum(p) from values(cast(23.4 as decimal(7,2))) t(p) was not correct because the result expression MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42415) The built-in dialects support OFFSET and paging query.
[ https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-42415. Resolution: Won't Fix > The built-in dialects support OFFSET and paging query. > -- > > Key: SPARK-42415 > URL: https://issues.apache.org/jira/browse/SPARK-42415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42824: Assignee: Haejoon Lee > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42824. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40458 [https://github.com/apache/spark/pull/40458] > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression, and replace all the common subexpression with the wrapper function. {code:java} private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } } {code} h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c was: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression, and replace all the common subexpression with the wrapper function. |private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* |SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v)| We can reuse the result of expression *v + 1* |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a| We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* |SELECT ** *** ** FROM *(* *SELECT v* * *v* *+* *1* *v1* from values{*}({*}{*}1{*}{*}){*} as *t2(v)* *) t* where *v1* *>* *5* and *v1* *<* *10*| We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* |SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as *t1(a, b)* join values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as *t2(x, y)*ON *b* *** *y* between ** *2* ** and ** *3*| We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* |*SELECT* a, count(b), count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* {*}end{*}) *as* count_bc_1, count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* {*}end{*}) *as* count_bc_2 *FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c) *GROUP* *BY* a| We can reuse the result of expression b + c > Support more subexpression elimination cases > > > Key: SPARK-42551 > URL: https://issues.apache.org/jira/browse/SPARK-42551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wan Kun >Priority: Major > > h1. *Design Sketch* > * Get all common expressions from input expressions. Recursively visits all > subexpressions regardless of whether the
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression, and replace all the common subexpression with the wrapper function. |private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* |SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v)| We can reuse the result of expression *v + 1* |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a| We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* |SELECT ** *** ** FROM *(* *SELECT v* * *v* *+* *1* *v1* from values{*}({*}{*}1{*}{*}){*} as *t2(v)* *) t* where *v1* *>* *5* and *v1* *<* *10*| We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* |SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as *t1(a, b)* join values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as *t2(x, y)*ON *b* *** *y* between ** *2* ** and ** *3*| We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* |*SELECT* a, count(b), count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* {*}end{*}) *as* count_bc_1, count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* {*}end{*}) *as* count_bc_2 *FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c) *GROUP* *BY* a| We can reuse the result of expression b + c was: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression, and replace all the common subexpression with the wrapper function. |private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* |SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v)| We can reuse the result of expression *v + 1* |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a| We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* |SELECT ** *** ** FROM \{*}({*} ** SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} as \{*}t2(v){*} {*}) t{*} where *v1* *>* ** *5* ** and *v1* *<* ** *10*| We can reuse the result of expression *v* *** *v* *+* ** *1* * h2. *Support subexpression elimination in JoinExec* |WITH *t1 (* SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as \{*}t(a, b)),{*} {*}t2 ({*} SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as \{*}t(x, y)){*} SELECT ** *** ** FROM *t1* join \{*}t2{*} ON *b* *** *y* between ** *2* ** and ** *3*| We can reuse the result of expression *b* *** *y* * h2. *Support subexpression elimination in ExpandExec* |*SELECT* a, count(b), count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* {*}end{*}) *as* count_bc_1, count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* {*}end{*}) *as* count_bc_2 *FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c) *GROUP* *BY* a| We can reuse the result of expression b + c > Support more subexpression elimination cases > > > Key: SPARK-42551 > URL: https://issues.apache.org/jira/browse/SPARK-42551 > Project: Spark > Issue Type: Improvement > Components:
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression, and replace all the common subexpression with the wrapper function. |private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* |SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v)| We can reuse the result of expression *v + 1* |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a| We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* |SELECT ** *** ** FROM \{*}({*} ** SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} as \{*}t2(v){*} {*}) t{*} where *v1* *>* ** *5* ** and *v1* *<* ** *10*| We can reuse the result of expression *v* *** *v* *+* ** *1* * h2. *Support subexpression elimination in JoinExec* |WITH *t1 (* SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as \{*}t(a, b)),{*} {*}t2 ({*} SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as \{*}t(x, y)){*} SELECT ** *** ** FROM *t1* join \{*}t2{*} ON *b* *** *y* between ** *2* ** and ** *3*| We can reuse the result of expression *b* *** *y* * h2. *Support subexpression elimination in ExpandExec* |*SELECT* a, count(b), count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* {*}end{*}) *as* count_bc_1, count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* {*}end{*}) *as* count_bc_2 *FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c) *GROUP* *BY* a| We can reuse the result of expression b + c was: Just like SPARK-33092, We can support subexpression elimination in FilterExec in Whole-stage codegen. For example: {code:java} SELECT * FROM ( SELECT v, v * v + 1 v1 from values(1) as t2(v) ) t where v > 0 and v1 > 5 and v1 < 10 Codegen plan {code:java} *(1) Project [v#1, ((v#1 * v#1) + 1) AS v1#0] +- *(1) Filter (((v#1 > 0) AND (((v#1 * v#1) + 1) > 5)) AND (((v#1 * v#1) + 1) < 10)) +- *(1) LocalTableScan [v#1] {code} The subexpression *(v#1 * v#1) + 1* will be execute twice times. > Support more subexpression elimination cases > > > Key: SPARK-42551 > URL: https://issues.apache.org/jira/browse/SPARK-42551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wan Kun >Priority: Major > > h1. *Design Sketch* > * Get all common expressions from input expressions. Recursively visits all > subexpressions regardless of whether the current expression is a conditional > expression. > * For each common expression: > * Add a new boolean variable *subExprInit_n* to indicate whether we have > already evaluated the common expression, and reset it to *false* at the start > of operator.consume() > * Add a new wrapper subExpr function for common subexpression, and replace > all the common subexpression with the wrapper function. > |private void subExpr_n(${argList.mkString(", ")}) { > if (!subExprInit_n) { > ${eval.code} > subExprInit_n = true; > subExprIsNull_n = ${eval.isNull}; > subExprValue_n = ${eval.value}; > } > }| > h1. *New support subexpression elimination patterns* > * > h2. *Support subexpression elimination with conditional expressions* > |SELECT case when v + 2 > 1 then 1 > when v + 1 > 2 then 2 > when v + 1 > 3 then 3 END vv > FROM values(1) as t2(v)| > We can reuse the result of expression *v + 1* > > |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) > min_bc > FROM values(1, 1, 1) as t(a, b, c) > GROUP BY a| > We can reuse the result of expression b + c > * > h2. *Support subexpression elimination in FilterExec* > > |SELECT ** *** ** FROM \{*}({*} > ** SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} > as \{*}t2(v){*} > {*}) t{*} > where *v1* *>* ** *5* ** and *v1* *<* ** *10*| > We can reuse the result of expression *v* ***
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Summary: Support more subexpression elimination cases (was: Support subexpression elimination in FilterExec and JoinExec) > Support more subexpression elimination cases > > > Key: SPARK-42551 > URL: https://issues.apache.org/jira/browse/SPARK-42551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wan Kun >Priority: Major > > Just like SPARK-33092, We can support subexpression elimination in FilterExec > in Whole-stage codegen. > For example: > {code:java} > SELECT * FROM ( > SELECT v, v * v + 1 v1 from values(1) as t2(v) > ) t > where v > 0 and v1 > 5 and v1 < 10 > Codegen plan > {code:java} > *(1) Project [v#1, ((v#1 * v#1) + 1) AS v1#0] > +- *(1) Filter (((v#1 > 0) AND (((v#1 * v#1) + 1) > 5)) AND (((v#1 * v#1) + > 1) < 10)) >+- *(1) LocalTableScan [v#1] > {code} > The subexpression *(v#1 * v#1) + 1* will be execute twice times. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42194) Allow `columns` parameter when creating DataFrame with Series.
[ https://issues.apache.org/jira/browse/SPARK-42194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42194: - Parent: SPARK-39199 Issue Type: Sub-task (was: Bug) > Allow `columns` parameter when creating DataFrame with Series. > -- > > Key: SPARK-42194 > URL: https://issues.apache.org/jira/browse/SPARK-42194 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > pandas API on Spark doesn't allow creating DataFrame with Series by > specifying the `columns` parameter as below: > {code:java} > >>> ps.DataFrame(psser, columns=["labels"]) > Traceback (most recent call last): > File "", line 1, in > File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__ > assert columns is None > AssertionError {code} > We should make it available. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38937) interpolate support param `limit_direction`
[ https://issues.apache.org/jira/browse/SPARK-38937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38937: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > interpolate support param `limit_direction` > --- > > Key: SPARK-38937 > URL: https://issues.apache.org/jira/browse/SPARK-38937 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39189) interpolate supports limit_area
[ https://issues.apache.org/jira/browse/SPARK-39189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39189: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > interpolate supports limit_area > --- > > Key: SPARK-39189 > URL: https://issues.apache.org/jira/browse/SPARK-39189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38943) EWM support ignore_na
[ https://issues.apache.org/jira/browse/SPARK-38943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-38943: - Parent: SPARK-39199 Issue Type: Sub-task (was: Improvement) > EWM support ignore_na > - > > Key: SPARK-38943 > URL: https://issues.apache.org/jira/browse/SPARK-38943 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42826) Add migration notes for update to supported pandas version.
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42826: Assignee: Haejoon Lee > Add migration notes for update to supported pandas version. > --- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42826) Add migration notes for update to supported pandas version.
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42826. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40459 [https://github.com/apache/spark/pull/40459] > Add migration notes for update to supported pandas version. > --- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name
[ https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42817. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40448 [https://github.com/apache/spark/pull/40448] > Spark driver logs are filled with Initializing service data for shuffle > service using name > -- > > Key: SPARK-42817 > URL: https://issues.apache.org/jira/browse/SPARK-42817 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.4.0 > > > With SPARK-34828, we added the ability to make the shuffle service name > configurable and we added a log > [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118] > that will log the shuffle service name. However, this log is printed in the > driver logs whenever there is new executor launched and pollutes the log. > {code} > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > {code} > We can just log this once in the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name
[ https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42817: - Assignee: Chandni Singh > Spark driver logs are filled with Initializing service data for shuffle > service using name > -- > > Key: SPARK-42817 > URL: https://issues.apache.org/jira/browse/SPARK-42817 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > With SPARK-34828, we added the ability to make the shuffle service name > configurable and we added a log > [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118] > that will log the shuffle service name. However, this log is printed in the > driver logs whenever there is new executor launched and pollutes the log. > {code} > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for > shuffle service using name 'spark_shuffle_311' > {code} > We can just log this once in the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42830) Link skipped stages on Spark UI
Yian Liou created SPARK-42830: - Summary: Link skipped stages on Spark UI Key: SPARK-42830 URL: https://issues.apache.org/jira/browse/SPARK-42830 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.3.2 Reporter: Yian Liou Add a link to the skipped Spark stages so that its easier to find the execution details on the UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42828: Assignee: Apache Spark > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Assignee: Apache Spark >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42828: Assignee: (was: Apache Spark) > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
[ https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701345#comment-17701345 ] Apache Spark commented on SPARK-42828: -- User 'j03wang' has created a pull request for this issue: https://github.com/apache/spark/pull/40460 > PySpark type hint returns Any for methods on GroupedData > > > Key: SPARK-42828 > URL: https://issues.apache.org/jira/browse/SPARK-42828 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Joe Wang >Priority: Minor > > Since upgrading to PySpark 3.3.x, type hints for > {code:java} > df.groupBy(...).count(){code} > are now returning Any instead of DataFrame, causing type inference issues > downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page
Yian Liou created SPARK-42829: - Summary: Added Identifier to the cached RDD operator on the Stages page Key: SPARK-42829 URL: https://issues.apache.org/jira/browse/SPARK-42829 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.3.2 Reporter: Yian Liou On the stages page in the Web UI, there is no distinction for which cached RDD is being executed in a particular stage. This Jira aims to add an repeat identifier to distinguish which cached RDD is being executed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42826) Add migration notes for update to supported pandas version.
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42826: Summary: Add migration notes for update to supported pandas version. (was: Add migration note for API changes) > Add migration notes for update to supported pandas version. > --- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42826: Assignee: (was: Apache Spark) > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42826: Assignee: Apache Spark > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701295#comment-17701295 ] Apache Spark commented on SPARK-42826: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40459 > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData
Joe Wang created SPARK-42828: Summary: PySpark type hint returns Any for methods on GroupedData Key: SPARK-42828 URL: https://issues.apache.org/jira/browse/SPARK-42828 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.2, 3.3.1, 3.3.0 Reporter: Joe Wang Since upgrading to PySpark 3.3.x, type hints for {code:java} df.groupBy(...).count(){code} are now returning Any instead of DataFrame, causing type inference issues downstream. This used to be correctly typed prior to 3.3.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42827) Support `functions#array_prepend`
Yang Jie created SPARK-42827: Summary: Support `functions#array_prepend` Key: SPARK-42827 URL: https://issues.apache.org/jira/browse/SPARK-42827 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Yang Jie Wait for SPARK-41233 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42826) Add migration note for API changes
[ https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701277#comment-17701277 ] Haejoon Lee commented on SPARK-42826: - I'm working on it > Add migration note for API changes > -- > > Key: SPARK-42826 > URL: https://issues.apache.org/jira/browse/SPARK-42826 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We deprecate & remove some APIs from > https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. > We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42826) Add migration note for API changes
Haejoon Lee created SPARK-42826: --- Summary: Add migration note for API changes Key: SPARK-42826 URL: https://issues.apache.org/jira/browse/SPARK-42826 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We deprecate & remove some APIs from https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas. We should mention this in migration guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42693) API Auditing
[ https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701240#comment-17701240 ] Dongjoon Hyun commented on SPARK-42693: --- Thank you! > API Auditing > > > Key: SPARK-42693 > URL: https://issues.apache.org/jira/browse/SPARK-42693 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Blocker > > Audit user-facing API of Spark 3.4. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42824: Assignee: Apache Spark > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42824: Assignee: (was: Apache Spark) > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701220#comment-17701220 ] Apache Spark commented on SPARK-42824: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40458 > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42824) Show proper error message for unsupported JVM attribute.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42824: Summary: Show proper error message for unsupported JVM attribute. (was: Show error messages for attributes that cannot be accessed) > Show proper error message for unsupported JVM attribute. > > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42824: Summary: Provide a clear error message for unsupported JVM attributes. (was: Show proper error message for unsupported JVM attribute.) > Provide a clear error message for unsupported JVM attributes. > - > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42825) setParams() only sets explicitly named params. Is this intentional or a bug?
[ https://issues.apache.org/jira/browse/SPARK-42825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Partridge updated SPARK-42825: Description: The Python signature/docstring of the setParams() method for the estimators and transformers under pyspark.ml imply that if you don't set any of the named params then they will be reset to their default values. Example from [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams] : {code:java} setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, probabilityCol="probability", tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None){code} In the extreme this would imply that if you called setParams() with no args then _all_ the params would be reset to their default values. But what actually happens is that _only_ the params passed in the call get changed; the values of any other params aren't affected. So if you call setParams() with no args then _no_ params get changed! So is this behavior by design? I guess it is from the name of the method. But it is counter-intuitive from its docstring. So if this behavior is intentional then perhaps the default docstring should make this explicit by saying something like: "Sets the named params. The values of other params are not affected." was: The Python signature/docstring of the setParams() method for the estimators and transformers under pyspark.ml imply that if you don't set any of the named params then they will be reset to their default values. Example from [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams] : {code:java} setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, probabilityCol="probability", tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None){code} In the extreme this would imply that if you called setParams() with no args then _all_ the params would be reset to their default values. But what actually happens is that _only_ the params passed in the call get changed; the values of any other params aren't affected. So if you call setParams() with no args then _no_ params get changed! So is this behavior by design? I guess it is from the name of the method. But it is counter-intuitive from its docstring. So if this behavior is intentional then perhaps the default docstring should make this explicit by saying something like: "Sets the named params. The values of other params are not affected." > setParams() only sets explicitly named params. Is this intentional or a bug? > > > Key: SPARK-42825 > URL: https://issues.apache.org/jira/browse/SPARK-42825 > Project: Spark > Issue Type: Question > Components: ML, PySpark >Affects Versions: 3.3.2 >Reporter: Lucas Partridge >Priority: Minor > > The Python signature/docstring of the setParams() method for the estimators > and transformers under pyspark.ml imply that if you don't set any of the > named params then they will be reset to their default values. > Example from > [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams] > : > {code:java} > setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, > probabilityCol="probability", tol=0.01, maxIter=100, seed=None, > aggregationDepth=2, weightCol=None){code} > In the extreme this would imply that if you called setParams() with no args > then _all_ the params would be reset to their default values. > But what actually happens is that _only_ the params passed in the call get > changed; the values of any other params aren't affected. So if you call > setParams() with no args then _no_ params get changed! > So is this behavior by design? I guess it is from the name of the method. But > it is counter-intuitive from its docstring. So if this behavior is > intentional then perhaps the default docstring should make this explicit by > saying something like: > "Sets the named params. The values of other params are not affected." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42825) setParams() only sets explicitly named params. Is this intentional or a bug?
Lucas Partridge created SPARK-42825: --- Summary: setParams() only sets explicitly named params. Is this intentional or a bug? Key: SPARK-42825 URL: https://issues.apache.org/jira/browse/SPARK-42825 Project: Spark Issue Type: Question Components: ML, PySpark Affects Versions: 3.3.2 Reporter: Lucas Partridge The Python signature/docstring of the setParams() method for the estimators and transformers under pyspark.ml imply that if you don't set any of the named params then they will be reset to their default values. Example from [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams] : {code:java} setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, probabilityCol="probability", tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None){code} In the extreme this would imply that if you called setParams() with no args then _all_ the params would be reset to their default values. But what actually happens is that _only_ the params passed in the call get changed; the values of any other params aren't affected. So if you call setParams() with no args then _no_ params get changed! So is this behavior by design? I guess it is from the name of the method. But it is counter-intuitive from its docstring. So if this behavior is intentional then perhaps the default docstring should make this explicit by saying something like: "Sets the named params. The values of other params are not affected." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41233: Assignee: (was: Apache Spark) > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41233: Assignee: Apache Spark > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41233: -- > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701105#comment-17701105 ] Hyukjin Kwon commented on SPARK-41233: -- Reverted at https://github.com/apache/spark/commit/baf90206d04738e63ea71f63d86668a7dc7c8f9a > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41233: - Fix Version/s: (was: 3.5.0) > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42824) Show error messages for attributes that cannot be accessed
[ https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701104#comment-17701104 ] Haejoon Lee commented on SPARK-42824: - I'm working on it > Show error messages for attributes that cannot be accessed > -- > > Key: SPARK-42824 > URL: https://issues.apache.org/jira/browse/SPARK-42824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There are attributes, such as "_jvm", that were accessible in PySpark but > cannot be accessed in Spark Connect. We need to display appropriate error > messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42824) Show error messages for attributes that cannot be accessed
Haejoon Lee created SPARK-42824: --- Summary: Show error messages for attributes that cannot be accessed Key: SPARK-42824 URL: https://issues.apache.org/jira/browse/SPARK-42824 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Haejoon Lee There are attributes, such as "_jvm", that were accessible in PySpark but cannot be accessed in Spark Connect. We need to display appropriate error messages for these cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-42819. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40455 [https://github.com/apache/spark/pull/40455] > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 3.5.0 > > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-42819: Assignee: Anish Shrigondekar > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42823: Assignee: Apache Spark > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42823: Assignee: (was: Apache Spark) > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
[ https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701078#comment-17701078 ] Apache Spark commented on SPARK-42823: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40457 > spark-sql shell supports multipart namespaces for initialization > > > Key: SPARK-42823 > URL: https://issues.apache.org/jira/browse/SPARK-42823 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41233: - Assignee: (was: Ruifeng Zheng) > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41233: - Assignee: Ruifeng Zheng > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41233. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 38947 [https://github.com/apache/spark/pull/38947] > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701051#comment-17701051 ] Apache Spark commented on SPARK-42720: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40456 > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42720: Assignee: (was: Apache Spark) > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42720: Assignee: Apache Spark > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn
[ https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701050#comment-17701050 ] Apache Spark commented on SPARK-42720: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40456 > Refactor the withSequenceColumn > --- > > Key: SPARK-42720 > URL: https://issues.apache.org/jira/browse/SPARK-42720 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization
Kent Yao created SPARK-42823: Summary: spark-sql shell supports multipart namespaces for initialization Key: SPARK-42823 URL: https://issues.apache.org/jira/browse/SPARK-42823 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42804) when target table format is textfile using `insert into select` will got error
[ https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701010#comment-17701010 ] kevinshin commented on SPARK-42804: --- @[~yumwang] below is my step by step reproduce this issue : hive version is HDP 3.1.0.3.1.4.0-315 [bigtop@hdpdev243 spark3]$ {color:#4c9aff}cat conf/spark-defaults.conf{color} # Generated by Apache Ambari. Tue Apr 27 11:19:24 2021 spark.sql.hive.convertMetastoreOrc true spark.sql.orc.filterPushdown true spark.sql.orc.impl native spark.sql.legacy.createHiveTableByDefault false [bigtop@hdpdev243 spark3]$ {color:#4c9aff}bin/spark-sql{color} 23/03/16 15:03:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.materializedview.rewriting.incremental does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.server2.webui.cors.allowed.headers does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.hook.proto.base-directory does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.load.data.owner does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.service.metrics.codahale.reporter.classes does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.strict.managed.tables does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.create.as.insert.only does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.metastore.db.type does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.metastore.warehouse.external.dir does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.heapsize does not exist 23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.server2.webui.enable.cors does not exist 23/03/16 15:03:29 WARN HiveClientImpl: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic 23/03/16 15:03:30 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Spark master: local[*], Application Id: local-1678950211606 spark-sql> select version(); 3.2.3 b53c341e0fefbb33d115ab630369a18765b7763d Time taken: 3.956 seconds, Fetched 1 row(s) spark-sql> {color:#4c9aff}create table test.tex_t1(name string, address string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;{color} 23/03/16 15:03:51 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.753 seconds spark-sql> {color:#4c9aff}create table test.tex_t2(name string, address string);{color} Time taken: 0.326 seconds spark-sql> {color:#4c9aff}insert into test.tex_t2 select 'a', 'b';{color} Time taken: 2.011 seconds spark-sql> {color:#4c9aff}insert into test.tex_t1 select 'a', 'b';{color} 23/03/16 15:04:13 WARN HdfsUtils: Unable to inherit permissions for file hdfs://nsdev/warehouse/tablespace/managed/hive/test.db/tex_t1/part-0-57c15f7a-7462-4101-af5d-9f4a22cf69df-c000 from file hdfs://nsdev/warehouse/tablespace/man aged/hive/test.db/tex_t1 23/03/16 15:04:13 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 24) after 5s. fireListenerEvent org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4977) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4964) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:2296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy21.fireListenerEvent(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Assigned] (SPARK-42800) Implement ml function {array_to_vector, vector_to_array}
[ https://issues.apache.org/jira/browse/SPARK-42800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42800: - Assignee: Ruifeng Zheng > Implement ml function {array_to_vector, vector_to_array} > > > Key: SPARK-42800 > URL: https://issues.apache.org/jira/browse/SPARK-42800 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42804) when target table format is textfile using `insert into select` will got error
[ https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701006#comment-17701006 ] Yuming Wang commented on SPARK-42804: - I can't reproduce it. Did you set any configs? > when target table format is textfile using `insert into select` will got error > -- > > Key: SPARK-42804 > URL: https://issues.apache.org/jira/browse/SPARK-42804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > *create* *table* test.tex_t1(name string, address string) *ROW* FORMAT > DELIMITED FIELDS TERMINATED *BY* ',' STORED *AS* TEXTFILE; > *insert* *into* test.tex_t1 *select* 'a', 'b'; > will got alot of message about : > WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to > reconnect (24 of 24) after 5s. fireListenerEvent > org.apache.thrift.transport.TTransportException > > But the data was actual write to table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42800) Implement ml function {array_to_vector, vector_to_array}
[ https://issues.apache.org/jira/browse/SPARK-42800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42800. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40432 [https://github.com/apache/spark/pull/40432] > Implement ml function {array_to_vector, vector_to_array} > > > Key: SPARK-42800 > URL: https://issues.apache.org/jira/browse/SPARK-42800 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42804) when target table format is textfile using `insert into select` will got error
[ https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700926#comment-17700926 ] kevinshin edited comment on SPARK-42804 at 3/16/23 6:47 AM: orc and parquet table won't have this problem. directly use hive beeline connect to hive also have no problem. was (Author: JIRAUSER281772): orc and parquet table won't have this problem. > when target table format is textfile using `insert into select` will got error > -- > > Key: SPARK-42804 > URL: https://issues.apache.org/jira/browse/SPARK-42804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > *create* *table* test.tex_t1(name string, address string) *ROW* FORMAT > DELIMITED FIELDS TERMINATED *BY* ',' STORED *AS* TEXTFILE; > *insert* *into* test.tex_t1 *select* 'a', 'b'; > will got alot of message about : > WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to > reconnect (24 of 24) after 5s. fireListenerEvent > org.apache.thrift.transport.TTransportException > > But the data was actual write to table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42693) API Auditing
[ https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701003#comment-17701003 ] Xinrong Meng commented on SPARK-42693: -- Thanks [~dongjoon], I just started it. I will keep sharing the progress. > API Auditing > > > Key: SPARK-42693 > URL: https://issues.apache.org/jira/browse/SPARK-42693 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Blocker > > Audit user-facing API of Spark 3.4. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700994#comment-17700994 ] Apache Spark commented on SPARK-42819: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40455 > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42819: Assignee: Apache Spark > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Apache Spark >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42819: Assignee: (was: Apache Spark) > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
[ https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700993#comment-17700993 ] Apache Spark commented on SPARK-42819: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40455 > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > --- > > Key: SPARK-42819 > URL: https://issues.apache.org/jira/browse/SPARK-42819 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add support for setting max_write_buffer_number and write_buffer_size for > RocksDB used in streaming > > We need these settings in order to control memory tuning for RocksDB. We > already expose settings for blockCache size. However, these 2 settings are > missing. This change proposes to add them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org