[jira] [Updated] (SPARK-46303) Remove unused code in `pyspark.pandas.tests.series.* `
[ https://issues.apache.org/jira/browse/SPARK-46303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46303: --- Labels: pull-request-available (was: ) > Remove unused code in `pyspark.pandas.tests.series.* ` > -- > > Key: SPARK-46303 > URL: https://issues.apache.org/jira/browse/SPARK-46303 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46303) Remove unused code in `pyspark.pandas.tests.series.* `
Ruifeng Zheng created SPARK-46303: - Summary: Remove unused code in `pyspark.pandas.tests.series.* ` Key: SPARK-46303 URL: https://issues.apache.org/jira/browse/SPARK-46303 Project: Spark Issue Type: Test Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46302) Fix maven daily testing
[ https://issues.apache.org/jira/browse/SPARK-46302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46302: --- Labels: pull-request-available (was: ) > Fix maven daily testing > --- > > Key: SPARK-46302 > URL: https://issues.apache.org/jira/browse/SPARK-46302 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46302) Fix maven daily testing
BingKun Pan created SPARK-46302: --- Summary: Fix maven daily testing Key: SPARK-46302 URL: https://issues.apache.org/jira/browse/SPARK-46302 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46300) Test missing test coverage for Column (pyspark.sql.column)
[ https://issues.apache.org/jira/browse/SPARK-46300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46300. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44228 [https://github.com/apache/spark/pull/44228] > Test missing test coverage for Column (pyspark.sql.column) > -- > > Key: SPARK-46300 > URL: https://issues.apache.org/jira/browse/SPARK-46300 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/column.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46300) Test missing test coverage for Column (pyspark.sql.column)
[ https://issues.apache.org/jira/browse/SPARK-46300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46300: Assignee: Hyukjin Kwon > Test missing test coverage for Column (pyspark.sql.column) > -- > > Key: SPARK-46300 > URL: https://issues.apache.org/jira/browse/SPARK-46300 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/column.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46298) Test catalog error classes (pyspark.sql.catalog)
[ https://issues.apache.org/jira/browse/SPARK-46298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46298: Assignee: Hyukjin Kwon > Test catalog error classes (pyspark.sql.catalog) > > > Key: SPARK-46298 > URL: https://issues.apache.org/jira/browse/SPARK-46298 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > See > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/catalog.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46298) Test catalog error classes (pyspark.sql.catalog)
[ https://issues.apache.org/jira/browse/SPARK-46298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46298. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44226 [https://github.com/apache/spark/pull/44226] > Test catalog error classes (pyspark.sql.catalog) > > > Key: SPARK-46298 > URL: https://issues.apache.org/jira/browse/SPARK-46298 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > See > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/catalog.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword
[ https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-46058: Labels: pull-request-available (was: pull-request-available releasenotes) > [CORE] Add separate flag for privateKeyPassword > --- > > Key: SPARK-46058 > URL: https://issues.apache.org/jira/browse/SPARK-46058 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Right now with config inheritance we support: > * JKS with password A, PEM with password B > * JKS with no password, PEM with password A > * JKS and PEM with no password > > But we do not support the case where JKS has a password and PEM does not. If > we set keyPassword we will attempt to use it, and cannot set > `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the > easiest workaround. > > This was noticed while migrating some existing deployments to the RPC SSL > support where we use openssl support for RPC and use a key with no password -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword
[ https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-46058: Labels: pull-request-available releasenotes (was: pull-request-available) > [CORE] Add separate flag for privateKeyPassword > --- > > Key: SPARK-46058 > URL: https://issues.apache.org/jira/browse/SPARK-46058 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available, releasenotes > Fix For: 4.0.0 > > > Right now with config inheritance we support: > * JKS with password A, PEM with password B > * JKS with no password, PEM with password A > * JKS and PEM with no password > > But we do not support the case where JKS has a password and PEM does not. If > we set keyPassword we will attempt to use it, and cannot set > `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the > easiest workaround. > > This was noticed while migrating some existing deployments to the RPC SSL > support where we use openssl support for RPC and use a key with no password -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46301) Support `spark.worker.(initial|max)RegistrationRetries`
[ https://issues.apache.org/jira/browse/SPARK-46301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46301: - Assignee: Dongjoon Hyun > Support `spark.worker.(initial|max)RegistrationRetries` > --- > > Key: SPARK-46301 > URL: https://issues.apache.org/jira/browse/SPARK-46301 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46296) Test captured errors (pyspark.errors.exceptions.captured)
[ https://issues.apache.org/jira/browse/SPARK-46296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46296. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44224 [https://github.com/apache/spark/pull/44224] > Test captured errors (pyspark.errors.exceptions.captured) > - > > Key: SPARK-46296 > URL: https://issues.apache.org/jira/browse/SPARK-46296 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46301) Support `spark.worker.(initial|max)RegistrationRetries`
Dongjoon Hyun created SPARK-46301: - Summary: Support `spark.worker.(initial|max)RegistrationRetries` Key: SPARK-46301 URL: https://issues.apache.org/jira/browse/SPARK-46301 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46301) Support `spark.worker.(initial|max)RegistrationRetries`
[ https://issues.apache.org/jira/browse/SPARK-46301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46301: --- Labels: pull-request-available (was: ) > Support `spark.worker.(initial|max)RegistrationRetries` > --- > > Key: SPARK-46301 > URL: https://issues.apache.org/jira/browse/SPARK-46301 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46296) Test captured errors (pyspark.errors.exceptions.captured)
[ https://issues.apache.org/jira/browse/SPARK-46296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46296: Assignee: Hyukjin Kwon > Test captured errors (pyspark.errors.exceptions.captured) > - > > Key: SPARK-46296 > URL: https://issues.apache.org/jira/browse/SPARK-46296 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46300) Test missing test coverage for Column (pyspark.sql.column)
[ https://issues.apache.org/jira/browse/SPARK-46300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46300: --- Labels: pull-request-available (was: ) > Test missing test coverage for Column (pyspark.sql.column) > -- > > Key: SPARK-46300 > URL: https://issues.apache.org/jira/browse/SPARK-46300 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/column.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46300) Test missing test coverage for Column (pyspark.sql.column)
Hyukjin Kwon created SPARK-46300: Summary: Test missing test coverage for Column (pyspark.sql.column) Key: SPARK-46300 URL: https://issues.apache.org/jira/browse/SPARK-46300 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/column.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46299) Make `spark.deploy.recovery*` documentation up-to-date
[ https://issues.apache.org/jira/browse/SPARK-46299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46299: -- Summary: Make `spark.deploy.recovery*` documentation up-to-date (was: Make `spark.deploy.recovery*` up-to-date) > Make `spark.deploy.recovery*` documentation up-to-date > -- > > Key: SPARK-46299 > URL: https://issues.apache.org/jira/browse/SPARK-46299 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Fix Version/s: 3.3.4 > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1, 3.3.4, 3.4.3 > > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Fix Version/s: 3.4.3 > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1, 3.4.3 > > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46299) Make `spark.deploy.recovery*` up-to-date
[ https://issues.apache.org/jira/browse/SPARK-46299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46299. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44227 [https://github.com/apache/spark/pull/44227] > Make `spark.deploy.recovery*` up-to-date > > > Key: SPARK-46299 > URL: https://issues.apache.org/jira/browse/SPARK-46299 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46299) Make `spark.deploy.recovery*` up-to-date
Dongjoon Hyun created SPARK-46299: - Summary: Make `spark.deploy.recovery*` up-to-date Key: SPARK-46299 URL: https://issues.apache.org/jira/browse/SPARK-46299 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46299) Make `spark.deploy.recovery*` up-to-date
[ https://issues.apache.org/jira/browse/SPARK-46299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46299: --- Labels: pull-request-available (was: ) > Make `spark.deploy.recovery*` up-to-date > > > Key: SPARK-46299 > URL: https://issues.apache.org/jira/browse/SPARK-46299 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46298) Test catalog error classes (pyspark.sql.catalog)
[ https://issues.apache.org/jira/browse/SPARK-46298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46298: --- Labels: pull-request-available (was: ) > Test catalog error classes (pyspark.sql.catalog) > > > Key: SPARK-46298 > URL: https://issues.apache.org/jira/browse/SPARK-46298 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > See > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/catalog.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46297) Exclude generated files from the code coverage report
[ https://issues.apache.org/jira/browse/SPARK-46297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46297: Assignee: Hyukjin Kwon > Exclude generated files from the code coverage report > - > > Key: SPARK-46297 > URL: https://issues.apache.org/jira/browse/SPARK-46297 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > We should exclude > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/tree/python/pyspark/sql/connect/proto -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46297) Exclude generated files from the code coverage report
[ https://issues.apache.org/jira/browse/SPARK-46297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46297. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44225 [https://github.com/apache/spark/pull/44225] > Exclude generated files from the code coverage report > - > > Key: SPARK-46297 > URL: https://issues.apache.org/jira/browse/SPARK-46297 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We should exclude > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/tree/python/pyspark/sql/connect/proto -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46298) Test catalog error classes (pyspark.sql.catalog)
Hyukjin Kwon created SPARK-46298: Summary: Test catalog error classes (pyspark.sql.catalog) Key: SPARK-46298 URL: https://issues.apache.org/jira/browse/SPARK-46298 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon See https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/sql/catalog.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46296) Test captured errors (pyspark.errors.exceptions.captured)
[ https://issues.apache.org/jira/browse/SPARK-46296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46296: - Summary: Test captured errors (pyspark.errors.exceptions.captured) (was: Test captured errors of TestResult (pyspark.errors.exceptions.captured)) > Test captured errors (pyspark.errors.exceptions.captured) > - > > Key: SPARK-46296 > URL: https://issues.apache.org/jira/browse/SPARK-46296 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46297) Exclude generated files from the code coverage report
[ https://issues.apache.org/jira/browse/SPARK-46297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46297: --- Labels: pull-request-available (was: ) > Exclude generated files from the code coverage report > - > > Key: SPARK-46297 > URL: https://issues.apache.org/jira/browse/SPARK-46297 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > We should exclude > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/tree/python/pyspark/sql/connect/proto -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46297) Exclude generated files from the code coverage report
Hyukjin Kwon created SPARK-46297: Summary: Exclude generated files from the code coverage report Key: SPARK-46297 URL: https://issues.apache.org/jira/browse/SPARK-46297 Project: Spark Issue Type: Sub-task Components: Project Infra, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon We should exclude https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/tree/python/pyspark/sql/connect/proto -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46296) Test captured errors of TestResult (pyspark.errors.exceptions.captured)
[ https://issues.apache.org/jira/browse/SPARK-46296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46296: --- Labels: pull-request-available (was: ) > Test captured errors of TestResult (pyspark.errors.exceptions.captured) > --- > > Key: SPARK-46296 > URL: https://issues.apache.org/jira/browse/SPARK-46296 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46296) Test captured errors of TestResult (pyspark.errors.exceptions.captured)
Hyukjin Kwon created SPARK-46296: Summary: Test captured errors of TestResult (pyspark.errors.exceptions.captured) Key: SPARK-46296 URL: https://issues.apache.org/jira/browse/SPARK-46296 Project: Spark Issue Type: Sub-task Components: PySpark, Tests Affects Versions: 4.0.0 Reporter: Hyukjin Kwon https://app.codecov.io/gh/apache/spark/commit/1a651753f4e760643d719add3b16acd311454c76/blob/python/pyspark/errors/exceptions/captured.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword
[ https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-46058. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43998 [https://github.com/apache/spark/pull/43998] > [CORE] Add separate flag for privateKeyPassword > --- > > Key: SPARK-46058 > URL: https://issues.apache.org/jira/browse/SPARK-46058 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Right now with config inheritance we support: > * JKS with password A, PEM with password B > * JKS with no password, PEM with password A > * JKS and PEM with no password > > But we do not support the case where JKS has a password and PEM does not. If > we set keyPassword we will attempt to use it, and cannot set > `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the > easiest workaround. > > This was noticed while migrating some existing deployments to the RPC SSL > support where we use openssl support for RPC and use a key with no password -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword
[ https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-46058: --- Assignee: Hasnain Lakhani > [CORE] Add separate flag for privateKeyPassword > --- > > Key: SPARK-46058 > URL: https://issues.apache.org/jira/browse/SPARK-46058 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > > Right now with config inheritance we support: > * JKS with password A, PEM with password B > * JKS with no password, PEM with password A > * JKS and PEM with no password > > But we do not support the case where JKS has a password and PEM does not. If > we set keyPassword we will attempt to use it, and cannot set > `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the > easiest workaround. > > This was noticed while migrating some existing deployments to the RPC SSL > support where we use openssl support for RPC and use a key with no password -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46292) Show a summary of workers in MasterPage
[ https://issues.apache.org/jira/browse/SPARK-46292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46292: - Assignee: Dongjoon Hyun > Show a summary of workers in MasterPage > --- > > Key: SPARK-46292 > URL: https://issues.apache.org/jira/browse/SPARK-46292 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46292) Show a summary of workers in MasterPage
[ https://issues.apache.org/jira/browse/SPARK-46292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46292. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44218 [https://github.com/apache/spark/pull/44218] > Show a summary of workers in MasterPage > --- > > Key: SPARK-46292 > URL: https://issues.apache.org/jira/browse/SPARK-46292 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor
[ https://issues.apache.org/jira/browse/SPARK-46290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46290: Assignee: Allison Wang > Change saveMode to overwrite for DataSourceWriter constructor > - > > Key: SPARK-46290 > URL: https://issues.apache.org/jira/browse/SPARK-46290 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor
[ https://issues.apache.org/jira/browse/SPARK-46290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46290. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44216 [https://github.com/apache/spark/pull/44216] > Change saveMode to overwrite for DataSourceWriter constructor > - > > Key: SPARK-46290 > URL: https://issues.apache.org/jira/browse/SPARK-46290 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-46295: -- Affects Version/s: 3.4.2 (was: 3.4.1) > TPCDS q39a and a39b have correctness issues with broadcast hash join and > shuffled hash join > --- > > Key: SPARK-46295 > URL: https://issues.apache.org/jira/browse/SPARK-46295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: correctness > > {{TPCDSQueryTestSuite}} fails for q39a and a39b with > {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with > {{sortMergeJoinConf}} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39a"{code} > {code} > [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) > [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but > got "...25 1.022382911080458[5 ..." > {code} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39b"{code} > {code} > [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) > [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 > 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568116 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154002] > 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 > 3 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568118 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154] > 4 16211 2 352.25 1" Result did not match > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-46295: -- Labels: correctness (was: ) > TPCDS q39a and a39b have correctness issues with broadcast hash join and > shuffled hash join > --- > > Key: SPARK-46295 > URL: https://issues.apache.org/jira/browse/SPARK-46295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: correctness > > {{TPCDSQueryTestSuite}} fails for q39a and a39b with > {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with > {{sortMergeJoinConf}} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39a"{code} > {code} > [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) > [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but > got "...25 1.022382911080458[5 ..." > {code} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39b"{code} > {code} > [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) > [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 > 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568116 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154002] > 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 > 3 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568118 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154] > 4 16211 2 352.25 1" Result did not match > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46294) Clean up initValue vs zeroValue semantics in SQLMetrics
[ https://issues.apache.org/jira/browse/SPARK-46294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46294: --- Labels: pull-request-available (was: ) > Clean up initValue vs zeroValue semantics in SQLMetrics > --- > > Key: SPARK-46294 > URL: https://issues.apache.org/jira/browse/SPARK-46294 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Davin Tjong >Priority: Minor > Labels: pull-request-available > > The semantics of initValue and _zeroValue in SQLMetrics is a little bit > confusing, since they effectively mean the same thing. Changing it to the > following would be clearer, especially in terms of defining what an "invalid" > metric is. > > proposed definitions: > > initValue is the starting value for a SQLMetric. If a metric has value equal > to its initValue, then it should be filtered out before aggregating with > SQLMetrics.stringValue(). > > zeroValue defines the lowest value considered valid. If a SQLMetric is > invalid, it is set to zeroValue upon receiving any updates, and it also > reports zeroValue as its value to avoid exposing it to the user > programatically (concern previouosly addressed in SPARK-41442). > For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that > the metric is by default invalid. At the end of a task, we will update the > metric making it valid, and the invalid metrics will be filtered out when > calculating min, max, etc. as a workaround for SPARK-11013. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
Kazuyuki Tanimura created SPARK-46295: - Summary: TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join Key: SPARK-46295 URL: https://issues.apache.org/jira/browse/SPARK-46295 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 4.0.0 Reporter: Kazuyuki Tanimura {{TPCDSQueryTestSuite}} fails for q39a and a39b with {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with {{sortMergeJoinConf}} {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly *TPCDSQueryTestSuite -- -z q39a"{code} {code} [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but got "...25 1.022382911080458[5 ..." {code} {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly *TPCDSQueryTestSuite -- -z q39b"{code} {code} [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 10427 2 381.25 1.0623056061004696 [info] 333151 271.75 1.555976998814345 3 3315 2 393.75 1.0196319345405949 [info] 333931 260.0 1.5009563026568116 3 3393 2 470.25 1.129275872154205 [info] 416211 1 257.7 1.6381074811154002] 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 3 10427 2 381.25 1.0623056061004696 [info] 333151 271.75 1.555976998814345 3 3315 2 393.75 1.0196319345405949 [info] 333931 260.0 1.5009563026568118 3 3393 2 470.25 1.129275872154205 [info] 416211 1 257.7 1.6381074811154] 4 16211 2 352.25 1" Result did not match {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46294) Clean up initValue vs zeroValue semantics in SQLMetrics
Davin Tjong created SPARK-46294: --- Summary: Clean up initValue vs zeroValue semantics in SQLMetrics Key: SPARK-46294 URL: https://issues.apache.org/jira/browse/SPARK-46294 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Davin Tjong The semantics of initValue and _zeroValue in SQLMetrics is a little bit confusing, since they effectively mean the same thing. Changing it to the following would be clearer, especially in terms of defining what an "invalid" metric is. proposed definitions: initValue is the starting value for a SQLMetric. If a metric has value equal to its initValue, then it should be filtered out before aggregating with SQLMetrics.stringValue(). zeroValue defines the lowest value considered valid. If a SQLMetric is invalid, it is set to zeroValue upon receiving any updates, and it also reports zeroValue as its value to avoid exposing it to the user programatically (concern previouosly addressed in SPARK-41442). For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that the metric is by default invalid. At the end of a task, we will update the metric making it valid, and the invalid metrics will be filtered out when calculating min, max, etc. as a workaround for SPARK-11013. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46294) Clean up initValue vs zeroValue semantics in SQLMetrics
[ https://issues.apache.org/jira/browse/SPARK-46294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davin Tjong updated SPARK-46294: Component/s: SQL (was: Spark Core) > Clean up initValue vs zeroValue semantics in SQLMetrics > --- > > Key: SPARK-46294 > URL: https://issues.apache.org/jira/browse/SPARK-46294 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Davin Tjong >Priority: Minor > > The semantics of initValue and _zeroValue in SQLMetrics is a little bit > confusing, since they effectively mean the same thing. Changing it to the > following would be clearer, especially in terms of defining what an "invalid" > metric is. > > proposed definitions: > > initValue is the starting value for a SQLMetric. If a metric has value equal > to its initValue, then it should be filtered out before aggregating with > SQLMetrics.stringValue(). > > zeroValue defines the lowest value considered valid. If a SQLMetric is > invalid, it is set to zeroValue upon receiving any updates, and it also > reports zeroValue as its value to avoid exposing it to the user > programatically (concern previouosly addressed in SPARK-41442). > For many SQLMetrics, we use initValue = -1 and zeroValue = 0 to indicate that > the metric is by default invalid. At the end of a task, we will update the > metric making it valid, and the invalid metrics will be filtered out when > calculating min, max, etc. as a workaround for SPARK-11013. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44976) Preserve full principal user name on executor side
[ https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44976: --- Labels: pull-request-available (was: ) > Preserve full principal user name on executor side > -- > > Key: SPARK-44976 > URL: https://issues.apache.org/jira/browse/SPARK-44976 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3, 3.3.3, 3.4.1 >Reporter: YUBI LEE >Priority: Major > Labels: pull-request-available > > SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use > shortname instead of full principal name. > Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the > side of non-kerberized hdfs namenode. > For example, I use 2 hdfs cluster. One is kerberized, the other one is not > kerberized. > I make a rule to add some prefix to username on the non-kerberized cluster if > some one access it from the kerberized cluster. > {code} > > hadoop.security.auth_to_local > > RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > DEFAULT > > {code} > However, if I submit spark job with keytab & principal option, hdfs directory > and files ownership is not coherent. > (I change some words for privacy.) > {code} > $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23 > Found 52 items > -rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/_SUCCESS > -rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15 > hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > {code} > Another interesting point is that if I submit spark job without keytab and > principal option but with kerberos authentication with {{kinit}}, it will not > follow {{hadoop.security.auth_to_local}} rule completely. > {code} > $ hdfs dfs -ls hdfs:///user/eub/output/ > Found 3 items > -rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31 > hdfs:///user/eub/output/_SUCCESS > -rw-rw-r--+ 3 eub hdfs512 2023-08-25 12:31 > hdfs:///user/eub/output/part-0.gz > -rw-rw-r--+ 3 eub hdfs574 2023-08-25 12:31 > hdfs:///user/eub/output/part-1.gz > {code} > I finally found that if I submit spark job with {{--principal}} and > {{--keytab}} option, ugi will be different. > (refer to > https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905). > Only file ({{_SUCCESS}}) and output directory created by driver (application > master side) will respect {{hadoop.security.auth_to_local}} on the > non-kerberized namenode only if {{--principal}} and {{--keytab}] options are > provided. > No matter how hdfs files or directory are created by executor or driver, > those should respect {{hadoop.security.auth_to_local}} rule and should be the > same. > Workaround is to pass additional argument to change {{SPARK_USER}} on the > executor side. > e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}} > {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. > There are some logics to append environment value with {{:}} (colon) as a > separator. > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893 > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46293) Add protobuf to required dependency for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-46293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46293: --- Labels: pull-request-available (was: ) > Add protobuf to required dependency for Spark Connect > - > > Key: SPARK-46293 > URL: https://issues.apache.org/jira/browse/SPARK-46293 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Add missing required package for docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46293) Add protobuf to required dependency for Spark Connect
Haejoon Lee created SPARK-46293: --- Summary: Add protobuf to required dependency for Spark Connect Key: SPARK-46293 URL: https://issues.apache.org/jira/browse/SPARK-46293 Project: Spark Issue Type: Bug Components: Connect, Documentation, PySpark Affects Versions: 4.0.0 Reporter: Haejoon Lee Add missing required package for docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Fix Version/s: 3.5.1 > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1 > > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46274) Range operator computeStats() proper long conversions
[ https://issues.apache.org/jira/browse/SPARK-46274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46274: --- Assignee: Kelvin Jiang > Range operator computeStats() proper long conversions > - > > Key: SPARK-46274 > URL: https://issues.apache.org/jira/browse/SPARK-46274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Assignee: Kelvin Jiang >Priority: Major > Labels: pull-request-available > > Range operator's `computeStats()` function unsafely casts from `BigInt` to > `Long` and causes issues downstream with statistics estimation. Adds bounds > checking to avoid crashing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46274) Range operator computeStats() proper long conversions
[ https://issues.apache.org/jira/browse/SPARK-46274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46274. - Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44191 [https://github.com/apache/spark/pull/44191] > Range operator computeStats() proper long conversions > - > > Key: SPARK-46274 > URL: https://issues.apache.org/jira/browse/SPARK-46274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Assignee: Kelvin Jiang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > Range operator's `computeStats()` function unsafely casts from `BigInt` to > `Long` and causes issues downstream with statistics estimation. Adds bounds > checking to avoid crashing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46292) Show a summary of workers in MasterPage
Dongjoon Hyun created SPARK-46292: - Summary: Show a summary of workers in MasterPage Key: SPARK-46292 URL: https://issues.apache.org/jira/browse/SPARK-46292 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46292) Show a summary of workers in MasterPage
[ https://issues.apache.org/jira/browse/SPARK-46292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46292: --- Labels: pull-request-available (was: ) > Show a summary of workers in MasterPage > --- > > Key: SPARK-46292 > URL: https://issues.apache.org/jira/browse/SPARK-46292 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor
[ https://issues.apache.org/jira/browse/SPARK-46290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46290: --- Labels: pull-request-available (was: ) > Change saveMode to overwrite for DataSourceWriter constructor > - > > Key: SPARK-46290 > URL: https://issues.apache.org/jira/browse/SPARK-46290 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46291) Koalas Testing Migration
[ https://issues.apache.org/jira/browse/SPARK-46291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-46291: - Description: Test migration from Koalas to Spark repository, including setting up the testing environment and dependencies, and CI jobs. > Koalas Testing Migration > > > Key: SPARK-46291 > URL: https://issues.apache.org/jira/browse/SPARK-46291 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Test migration from Koalas to Spark repository, including setting up the > testing environment and dependencies, and CI jobs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46291) Koalas Testing Migration
[ https://issues.apache.org/jira/browse/SPARK-46291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-46291: - Summary: Koalas Testing Migration (was: Testing migration) > Koalas Testing Migration > > > Key: SPARK-46291 > URL: https://issues.apache.org/jira/browse/SPARK-46291 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46291) Testing migration
[ https://issues.apache.org/jira/browse/SPARK-46291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-46291: Assignee: Xinrong Meng > Testing migration > - > > Key: SPARK-46291 > URL: https://issues.apache.org/jira/browse/SPARK-46291 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46291) Testing migration
[ https://issues.apache.org/jira/browse/SPARK-46291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-46291. -- Resolution: Done > Testing migration > - > > Key: SPARK-46291 > URL: https://issues.apache.org/jira/browse/SPARK-46291 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34999) Consolidate PySpark testing utils
[ https://issues.apache.org/jira/browse/SPARK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-34999: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Consolidate PySpark testing utils > - > > Key: SPARK-34999 > URL: https://issues.apache.org/jira/browse/SPARK-34999 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and > `python/pyspark/testing` contain test utilities for pyspark. Consolidating > them makes code cleaner and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35012) Port Koalas DataFrame related unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35012: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas DataFrame related unit tests into PySpark > - > > Key: SPARK-35012 > URL: https://issues.apache.org/jira/browse/SPARK-35012 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas DataFrame related unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35300) Standardize module name in install.rst
[ https://issues.apache.org/jira/browse/SPARK-35300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35300: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Standardize module name in install.rst > -- > > Key: SPARK-35300 > URL: https://issues.apache.org/jira/browse/SPARK-35300 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > We should use the full names of modules in install.rst. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35034) Port Koalas miscellaneous unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35034: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas miscellaneous unit tests into PySpark > - > > Key: SPARK-35034 > URL: https://issues.apache.org/jira/browse/SPARK-35034 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas miscellaneous unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35035) Port Koalas internal implementation unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35035: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas internal implementation unit tests into PySpark > --- > > Key: SPARK-35035 > URL: https://issues.apache.org/jira/browse/SPARK-35035 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas internal implementation related unit tests to > [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35040) Remove Spark-version related codes from test codes.
[ https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35040: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Remove Spark-version related codes from test codes. > --- > > Key: SPARK-35040 > URL: https://issues.apache.org/jira/browse/SPARK-35040 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > There are several places to check the PySpark version and switch the tests, > but now those are not necessary. > We should remove them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35098) Revisit pandas-on-Spark test cases that are disabled because of pandas nondeterministic return values
[ https://issues.apache.org/jira/browse/SPARK-35098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35098: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Revisit pandas-on-Spark test cases that are disabled because of pandas > nondeterministic return values > - > > Key: SPARK-35098 > URL: https://issues.apache.org/jira/browse/SPARK-35098 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > Some test cases have been disabled in the places as shown below because of > pandas nondeterministic return values: > * pandas returns `None` or `nan` randomly > python/pyspark/pandas/tests/test_series.py test_astype > * pandas returns `True` or `False` randomly > python/pyspark/pandas/tests/indexes/test_base.py test_monotonic > We should revisit them later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35033) Port Koalas plot unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35033: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas plot unit tests into PySpark > > > Key: SPARK-35033 > URL: https://issues.apache.org/jira/browse/SPARK-35033 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas plot unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35032) Port Koalas Index unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35032: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas Index unit tests into PySpark > - > > Key: SPARK-35032 > URL: https://issues.apache.org/jira/browse/SPARK-35032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas Index unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35031) Port Koalas operations on different frames tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35031: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas operations on different frames tests into PySpark > - > > Key: SPARK-35031 > URL: https://issues.apache.org/jira/browse/SPARK-35031 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas operations on different frames related unit > tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34996) Port Koalas Series related unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-34996: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port Koalas Series related unit tests into PySpark > -- > > Key: SPARK-34996 > URL: https://issues.apache.org/jira/browse/SPARK-34996 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas Series related unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34887) Port/integrate Koalas dependencies into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-34887: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port/integrate Koalas dependencies into PySpark > --- > > Key: SPARK-34887 > URL: https://issues.apache.org/jira/browse/SPARK-34887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas dependencies appropriately to PySpark > dependencies. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-34886: - Parent Issue: SPARK-46291 (was: SPARK-34849) > Port/integrate Koalas DataFrame unit test into PySpark > -- > > Key: SPARK-34886 > URL: https://issues.apache.org/jira/browse/SPARK-34886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port [Koalas DataFrame > test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py] > appropriately to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46291) Testing migration
Xinrong Meng created SPARK-46291: Summary: Testing migration Key: SPARK-46291 URL: https://issues.apache.org/jira/browse/SPARK-46291 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46275) Protobuf: Permissive mode should return null rather than struct with null fields
[ https://issues.apache.org/jira/browse/SPARK-46275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46275: --- Labels: pull-request-available (was: ) > Protobuf: Permissive mode should return null rather than struct with null > fields > > > Key: SPARK-46275 > URL: https://issues.apache.org/jira/browse/SPARK-46275 > Project: Spark > Issue Type: Bug > Components: Protobuf, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > Consider a protobuf with two fields {{message Person { string name = 1; int > id = 2; }} > * The struct returned by {{from_protobuf("Person")}} like this: > ** STRUCT > * If the underlying binary record fails to deserialize, it results in a > exception and query fails. > * Buf if the option {{mode}} is set to {{PERMISSIVE}} , malformed records > are tolerated {{null}} is returned. > ** {*}BUT{*}: The retuned struct looks like this \{"name: null, id: "null"} > * > ** > *** This is not convenient to the user. > *** *Ideally,* {{from_protobuf()}} *should return* {{null}} *.* > ** {{from_protobuf()}} borrowed the current behavior from {{from_avro()}} > implementation. It is not clear what the motivation was. > I think we should update the implementation to return {{null}} rather than a > struct with null-fields inside. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45580. --- Fix Version/s: 4.0.0 Resolution: Fixed > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0 > > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46230) Migrate RetriesExceeded into PySpark error.
[ https://issues.apache.org/jira/browse/SPARK-46230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46230: - Assignee: Haejoon Lee > Migrate RetriesExceeded into PySpark error. > --- > > Key: SPARK-46230 > URL: https://issues.apache.org/jira/browse/SPARK-46230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46230) Migrate RetriesExceeded into PySpark error.
[ https://issues.apache.org/jira/browse/SPARK-46230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46230. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44147 [https://github.com/apache/spark/pull/44147] > Migrate RetriesExceeded into PySpark error. > --- > > Key: SPARK-46230 > URL: https://issues.apache.org/jira/browse/SPARK-46230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46270) Use java14 instanceof expressions to replace the java8 instanceof statement
[ https://issues.apache.org/jira/browse/SPARK-46270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46270. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44187 [https://github.com/apache/spark/pull/44187] > Use java14 instanceof expressions to replace the java8 instanceof statement > --- > > Key: SPARK-46270 > URL: https://issues.apache.org/jira/browse/SPARK-46270 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor
Allison Wang created SPARK-46290: Summary: Change saveMode to overwrite for DataSourceWriter constructor Key: SPARK-46290 URL: https://issues.apache.org/jira/browse/SPARK-46290 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45580: - Assignee: Bruce Robbins > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Target Version/s: 3.3.4 > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Labels: correctness pull-request-available (was: pull-request-available) > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45580: -- Priority: Blocker (was: Major) > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793889#comment-17793889 ] Dongjoon Hyun commented on SPARK-45580: --- I raised this issue to the blocker for Apache Spark 3.3.4. > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Blocker > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793888#comment-17793888 ] Dongjoon Hyun commented on SPARK-45580: --- Thank you, [~bersprockets]. > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46283) Avoid testing the `streaming-kinesis-asl` module in the daily tests of branch-3.x.
[ https://issues.apache.org/jira/browse/SPARK-46283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46283. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44204 [https://github.com/apache/spark/pull/44204] > Avoid testing the `streaming-kinesis-asl` module in the daily tests of > branch-3.x. > -- > > Key: SPARK-46283 > URL: https://issues.apache.org/jira/browse/SPARK-46283 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > After the merge of https://github.com/apache/spark/pull/43736, the master > branch began testing the `streaming-kinesis-asl` module. > At the same time, because the daily test will reuse `build_and_test.yml`, the > daily test of branch-3.x also began testing `streaming-kinesis-asl`. > However, in branch-3.x, the env `ENABLE_KINESIS_TESTS` is hard-coded as 1 in > `dev/sparktestsupport/modules.py`: > https://github.com/apache/spark/blob/1321b4e64deaa1e58bf297c25b72319083056568/dev/sparktestsupport/modules.py#L332-L346 > which leads to the failure of the daily test of branch-3.x: > - branch-3.3: https://github.com/apache/spark/actions/runs/7111246311 > - branch-3.4: https://github.com/apache/spark/actions/runs/7098435892 > - branch-3.5: https://github.com/apache/spark/actions/runs/7099811235 > ``` > [info] > org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite *** > ABORTED *** (1 second, 14 milliseconds) > [info] java.lang.Exception: Kinesis tests enabled using environment > variable ENABLE_KINESIS_TESTS > [info] but could not find AWS credentials. Please follow instructions in AWS > documentation > [info] to set the credentials in your system such that the > DefaultAWSCredentialsProviderChain > [info] can find the credentials. > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:258) > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient$lzycompute(KinesisTestUtils.scala:58) > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient(KinesisTestUtils.scala:57) > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:168) > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:181) > [info] at > org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:84) > [info] at > org.apache.spark.streaming.kinesis.KinesisStreamTests.$anonfun$beforeAll$1(KinesisStreamSuite.scala:61) > [info] at > org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled(KinesisFunSuite.scala:41) > [info] at > org.apache.spark.streaming.kinesis.KinesisFunSuite.runIfTestsEnabled$(KinesisFunSuite.scala:39) > [info] at > org.apache.spark.streaming.kinesis.KinesisStreamTests.runIfTestsEnabled(KinesisStreamSuite.scala:42) > [info] at > org.apache.spark.streaming.kinesis.KinesisStreamTests.beforeAll(KinesisStreamSuite.scala:59) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.streaming.kinesis.KinesisStreamTests.org$scalatest$BeforeAndAfter$$super$run(KinesisStreamSuite.scala:42) > [info] at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273) > [info] at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271) > [info] at > org.apache.spark.streaming.kinesis.KinesisStreamTests.run(KinesisStreamSuite.scala:42) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) > [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:750) > [info] Test run > org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite started > [info] Test > org.apache.spark.streaming.kinesis.JavaKinesisInputDStreamBuilderSuite.testJavaKinesisDStreamBuilderOldApi > started > [info] Test >
[jira] [Resolved] (SPARK-46286) Document spark.io.compression.zstd.bufferPool.enabled
[ https://issues.apache.org/jira/browse/SPARK-46286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46286. --- Fix Version/s: 3.3.4 3.4.3 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44207 [https://github.com/apache/spark/pull/44207] > Document spark.io.compression.zstd.bufferPool.enabled > - > > Key: SPARK-46286 > URL: https://issues.apache.org/jira/browse/SPARK-46286 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 3.3.4, 3.4.3, 3.5.1, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46286) Document spark.io.compression.zstd.bufferPool.enabled
[ https://issues.apache.org/jira/browse/SPARK-46286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46286: - Assignee: Kent Yao > Document spark.io.compression.zstd.bufferPool.enabled > - > > Key: SPARK-46286 > URL: https://issues.apache.org/jira/browse/SPARK-46286 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46287) DataFrame.isEmpty should work with all datatypes
[ https://issues.apache.org/jira/browse/SPARK-46287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46287. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44209 [https://github.com/apache/spark/pull/44209] > DataFrame.isEmpty should work with all datatypes > > > Key: SPARK-46287 > URL: https://issues.apache.org/jira/browse/SPARK-46287 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46288) Remove unused code in `pyspark.pandas.tests.frame.*`
[ https://issues.apache.org/jira/browse/SPARK-46288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46288: - Assignee: Ruifeng Zheng > Remove unused code in `pyspark.pandas.tests.frame.*` > > > Key: SPARK-46288 > URL: https://issues.apache.org/jira/browse/SPARK-46288 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46288) Remove unused code in `pyspark.pandas.tests.frame.*`
[ https://issues.apache.org/jira/browse/SPARK-46288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46288. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44212 [https://github.com/apache/spark/pull/44212] > Remove unused code in `pyspark.pandas.tests.frame.*` > > > Key: SPARK-46288 > URL: https://issues.apache.org/jira/browse/SPARK-46288 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46273) Support INSERT INTO/OVERWRITE using DSv2 sources
[ https://issues.apache.org/jira/browse/SPARK-46273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46273: --- Labels: pull-request-available (was: ) > Support INSERT INTO/OVERWRITE using DSv2 sources > > > Key: SPARK-46273 > URL: https://issues.apache.org/jira/browse/SPARK-46273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46289: -- Affects Version/s: 3.3.3 > Exception when ordering by UDT in interpreted mode > -- > > Key: SPARK-46289 > URL: https://issues.apache.org/jira/browse/SPARK-46289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > In interpreted mode, ordering by a UDT will result in an exception. For > example: > {noformat} > import org.apache.spark.ml.linalg.{DenseVector, Vector} > val df = Seq.tabulate(30) { x => > (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + > 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) > }.toDF("id", "c1", "c2", "c3") > df.createOrReplaceTempView("df") > // this works > sql("select * from df order by c3").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this gets an error > sql("select * from df order by c3").collect > {noformat} > The second {{collect}} action results in the following exception: > {noformat} > org.apache.spark.SparkIllegalArgumentException: Type > UninitializedPhysicalType does not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > {noformat} > Note: You don't get an error if you use {{show}} rather than {{collect}}. > This is because {{show}} will implicitly add a {{limit}}, in which case the > ordering is performed by {{TakeOrderedAndProject}} rather than > {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46289) Exception when ordering by UDT in interpreted mode
Bruce Robbins created SPARK-46289: - Summary: Exception when ordering by UDT in interpreted mode Key: SPARK-46289 URL: https://issues.apache.org/jira/browse/SPARK-46289 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.2 Reporter: Bruce Robbins In interpreted mode, ordering by a UDT will result in an exception. For example: {noformat} import org.apache.spark.ml.linalg.{DenseVector, Vector} val df = Seq.tabulate(30) { x => (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) }.toDF("id", "c1", "c2", "c3") df.createOrReplaceTempView("df") // this works sql("select * from df order by c3").collect sql("set spark.sql.codegen.wholeStage=false") sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") // this gets an error sql("select * from df order by c3").collect {noformat} The second {{collect}} action results in the following exception: {noformat} org.apache.spark.SparkIllegalArgumentException: Type UninitializedPhysicalType does not support ordered operations. at org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) at org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) at org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) at org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) at org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) at org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) {noformat} Note: You don't get an error if you use {{show}} rather than {{collect}}. This is because {{show}} will implicitly add a {{limit}}, in which case the ordering is performed by {{TakeOrderedAndProject}} rather than {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46173) Skipping trimAll call in stringToDate functions to avoid needless string copy
[ https://issues.apache.org/jira/browse/SPARK-46173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46173: --- Assignee: Aleksandar Tomic > Skipping trimAll call in stringToDate functions to avoid needless string copy > - > > Key: SPARK-46173 > URL: https://issues.apache.org/jira/browse/SPARK-46173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > > In StringToDate function call we currently first call trimAll to remove any > whitespace and isocontrol characters. Trimming results in copying the input > string which is not really needed given that we can do all the parsing in > place by just skipping the whitespace/isocontrol chars. > Given that we have customers complaining about speed of stringtodate > function, especially when input string is long/potentially malformed proposal > is to skip trimAll call and do parsing in place. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46173) Skipping trimAll call in stringToDate functions to avoid needless string copy
[ https://issues.apache.org/jira/browse/SPARK-46173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46173. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44110 [https://github.com/apache/spark/pull/44110] > Skipping trimAll call in stringToDate functions to avoid needless string copy > - > > Key: SPARK-46173 > URL: https://issues.apache.org/jira/browse/SPARK-46173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In StringToDate function call we currently first call trimAll to remove any > whitespace and isocontrol characters. Trimming results in copying the input > string which is not really needed given that we can do all the parsing in > place by just skipping the whitespace/isocontrol chars. > Given that we have customers complaining about speed of stringtodate > function, especially when input string is long/potentially malformed proposal > is to skip trimAll call and do parsing in place. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45888) Apply error class framework to state data source & state metadata data source
[ https://issues.apache.org/jira/browse/SPARK-45888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45888: Assignee: Jungtaek Lim > Apply error class framework to state data source & state metadata data source > - > > Key: SPARK-45888 > URL: https://issues.apache.org/jira/browse/SPARK-45888 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Labels: pull-request-available > > Intended to be a blocker issue for the release of state data source reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45888) Apply error class framework to state data source & state metadata data source
[ https://issues.apache.org/jira/browse/SPARK-45888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45888. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44025 [https://github.com/apache/spark/pull/44025] > Apply error class framework to state data source & state metadata data source > - > > Key: SPARK-45888 > URL: https://issues.apache.org/jira/browse/SPARK-45888 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > > Intended to be a blocker issue for the release of state data source reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46288) Remove unused code in `pyspark.pandas.tests.frame.*`
[ https://issues.apache.org/jira/browse/SPARK-46288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46288: --- Labels: pull-request-available (was: ) > Remove unused code in `pyspark.pandas.tests.frame.*` > > > Key: SPARK-46288 > URL: https://issues.apache.org/jira/browse/SPARK-46288 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46288) Remove unused code in `pyspark.pandas.tests.frame.*`
Ruifeng Zheng created SPARK-46288: - Summary: Remove unused code in `pyspark.pandas.tests.frame.*` Key: SPARK-46288 URL: https://issues.apache.org/jira/browse/SPARK-46288 Project: Spark Issue Type: Test Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45720) Upgrade AWS SDK to v2 for Spark Kinesis connector
[ https://issues.apache.org/jira/browse/SPARK-45720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45720: --- Labels: pull-request-available (was: ) > Upgrade AWS SDK to v2 for Spark Kinesis connector > - > > Key: SPARK-45720 > URL: https://issues.apache.org/jira/browse/SPARK-45720 > Project: Spark > Issue Type: Sub-task > Components: Connect Contrib >Affects Versions: 3.5.0 >Reporter: Lantao Jin >Priority: Major > Labels: pull-request-available > > Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. > In this issue, we focus on the AWS SDK v2 upgrade in Kinesis connector -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46287) DataFrame.isEmpty should work with all datatypes
[ https://issues.apache.org/jira/browse/SPARK-46287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46287: --- Labels: pull-request-available (was: ) > DataFrame.isEmpty should work with all datatypes > > > Key: SPARK-46287 > URL: https://issues.apache.org/jira/browse/SPARK-46287 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org