[jira] [Updated] (SPARK-45621) Add feature to evaluate subquery before push down filter Optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-45621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maytas Monsereenusorn updated SPARK-45621: -- Summary: Add feature to evaluate subquery before push down filter Optimizer rule (was: Add feature to evaluate subquery before Optimizer rule to push down filter) > Add feature to evaluate subquery before push down filter Optimizer rule > --- > > Key: SPARK-45621 > URL: https://issues.apache.org/jira/browse/SPARK-45621 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Maytas Monsereenusorn >Priority: Major > > Some queries can benefit from having it's scalar subquery in the filter > evaluated while planning so that the scalar result (from the subquery) can be > push down. > This adds a new feature(which is disabled by default to maintain current > behavior) that would evaluate scalar subqueries in the Optimizer before rule > to push down filter. > For example, a query like > {code:java} > select * from t2 where b > (select max(a) from t1) {code} > where t1 is a small table but t2 is a very large table can benefit if we > first evaluate the subquery then push down the result to the pushed filter > (instead of having the subquery in the post scan filter) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45621) Add feature to evaluate subquery before Optimizer rule to push down filter
Maytas Monsereenusorn created SPARK-45621: - Summary: Add feature to evaluate subquery before Optimizer rule to push down filter Key: SPARK-45621 URL: https://issues.apache.org/jira/browse/SPARK-45621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.2 Reporter: Maytas Monsereenusorn Some queries can benefit from having it's scalar subquery in the filter evaluated while planning so that the scalar result (from the subquery) can be push down. This adds a new feature(which is disabled by default to maintain current behavior) that would evaluate scalar subqueries in the Optimizer before rule to push down filter. For example, a query like {code:java} select * from t2 where b > (select max(a) from t1) {code} where t1 is a small table but t2 is a very large table can benefit if we first evaluate the subquery then push down the result to the pushed filter (instead of having the subquery in the post scan filter) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43778) RewriteCorrelatedScalarSubquery should handle duplicate attributes
[ https://issues.apache.org/jira/browse/SPARK-43778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43778: --- Labels: pull-request-available (was: ) > RewriteCorrelatedScalarSubquery should handle duplicate attributes > -- > > Key: SPARK-43778 > URL: https://issues.apache.org/jira/browse/SPARK-43778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Andrey Gubichev >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This is a correctness problem caused by the fact that the decorrelation rule > does not dedup join attributes properly. This leads to the join on (c1 = c1), > which is simplified to True and the join becomes a cross product. > > Example query: > > {code:java} > create view t(c1, c2) as values (0, 1), (0, 2), (1, 2) > select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 having cnt > = 0) from t t1 > -- Correct answer: [(0, 1, null), (0, 2, null), (1, 2, null)] > +---+---+--+ > |c1 |c2 |scalarsubquery(c1)| > +---+---+--+ > |0 |1 |null | > |0 |1 |null | > |0 |2 |null | > |0 |2 |null | > |1 |2 |null | > |1 |2 |null | > +---+---+--+ {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45620) Fix user-facing APIs related to Python UDTF to use camelCase.
[ https://issues.apache.org/jira/browse/SPARK-45620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45620: --- Labels: pull-request-available (was: ) > Fix user-facing APIs related to Python UDTF to use camelCase. > - > > Key: SPARK-45620 > URL: https://issues.apache.org/jira/browse/SPARK-45620 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45620) Fix user-facing APIs related to Python UDTF to use camelCase.
Takuya Ueshin created SPARK-45620: - Summary: Fix user-facing APIs related to Python UDTF to use camelCase. Key: SPARK-45620 URL: https://issues.apache.org/jira/browse/SPARK-45620 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45523) Return useful error message if UDTF returns None for non-nullable column
[ https://issues.apache.org/jira/browse/SPARK-45523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-45523. --- Fix Version/s: 4.0.0 Assignee: Daniel Resolution: Fixed Issue resolved by pull request 43356 https://github.com/apache/spark/pull/43356 > Return useful error message if UDTF returns None for non-nullable column > > > Key: SPARK-45523 > URL: https://issues.apache.org/jira/browse/SPARK-45523 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45619) Apply the observed metrics to Observation object.
[ https://issues.apache.org/jira/browse/SPARK-45619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45619: --- Labels: pull-request-available (was: ) > Apply the observed metrics to Observation object. > - > > Key: SPARK-45619 > URL: https://issues.apache.org/jira/browse/SPARK-45619 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45619) Apply the observed metrics to Observation object.
Takuya Ueshin created SPARK-45619: - Summary: Apply the observed metrics to Observation object. Key: SPARK-45619 URL: https://issues.apache.org/jira/browse/SPARK-45619 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45617) Upgrade Apache Commons Crypto 1.2.0
[ https://issues.apache.org/jira/browse/SPARK-45617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45617: --- Labels: pull-request-available (was: ) > Upgrade Apache Commons Crypto 1.2.0 > --- > > Key: SPARK-45617 > URL: https://issues.apache.org/jira/browse/SPARK-45617 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: L. C. Hsieh >Priority: Minor > Labels: pull-request-available > > Currently used 1.1.0 is more than 3 years ago (2020-08-28 released). We > should upgrade the library to latest 1.2.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45618) Remove BaseErrorHandler
[ https://issues.apache.org/jira/browse/SPARK-45618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45618: --- Labels: pull-request-available (was: ) > Remove BaseErrorHandler > --- > > Key: SPARK-45618 > URL: https://issues.apache.org/jira/browse/SPARK-45618 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: L. C. Hsieh >Priority: Minor > Labels: pull-request-available > > We can remove a workaround trait BaseErrorHandler which was added long time > ago (SPARK-25535) for CRYPTO-141 which was fixed 5 years ago. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45618) Remove BaseErrorHandler
L. C. Hsieh created SPARK-45618: --- Summary: Remove BaseErrorHandler Key: SPARK-45618 URL: https://issues.apache.org/jira/browse/SPARK-45618 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: L. C. Hsieh We can remove a workaround trait BaseErrorHandler which was added long time ago (SPARK-25535) for CRYPTO-141 which was fixed 5 years ago. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45617) Upgrade Apache Commons Crypto 1.2.0
L. C. Hsieh created SPARK-45617: --- Summary: Upgrade Apache Commons Crypto 1.2.0 Key: SPARK-45617 URL: https://issues.apache.org/jira/browse/SPARK-45617 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: L. C. Hsieh Currently used 1.1.0 is more than 3 years ago (2020-08-28 released). We should upgrade the library to latest 1.2.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1954#comment-1954 ] Allison Wang commented on SPARK-45023: -- [~abhinavofficial] this proposal is on hold, given the feedback received from the SPIP. > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang resolved SPARK-45023. -- Resolution: Won't Do > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45616) Usages of ParVector are unsafe because it does not propagate ThreadLocals or SparkSession
[ https://issues.apache.org/jira/browse/SPARK-45616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45616: --- Labels: pull-request-available (was: ) > Usages of ParVector are unsafe because it does not propagate ThreadLocals or > SparkSession > - > > Key: SPARK-45616 > URL: https://issues.apache.org/jira/browse/SPARK-45616 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL, Tests >Affects Versions: 3.5.0 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Minor > Labels: pull-request-available > > CastSuiteBase and ExpressionInfoSuite use ParVector.foreach() to run Spark > SQL queries in parallel. They incorrectly assume that each parallel operation > will inherit the main thread’s active SparkSession. This is only true when > these parallel operations run in freshly-created threads. However, when other > code has already run some parallel operations before Spark was started, then > there may be existing threads that do not have an active SparkSession. In > that case, these tests fail with NullPointerExceptions when creating > SparkPlans or running SQL queries. > The fix is to use the existing method ThreadUtils.parmap(). This method > creates fresh threads that inherit the current active SparkSession, and it > propagates the Spark ThreadLocals. > We should also add a scalastyle warning against use of ParVector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45616) Usages of ParVector are unsafe because it does not propagate ThreadLocals or SparkSession
Ankur Dave created SPARK-45616: -- Summary: Usages of ParVector are unsafe because it does not propagate ThreadLocals or SparkSession Key: SPARK-45616 URL: https://issues.apache.org/jira/browse/SPARK-45616 Project: Spark Issue Type: Bug Components: Spark Core, SQL, Tests Affects Versions: 3.5.0 Reporter: Ankur Dave Assignee: Ankur Dave CastSuiteBase and ExpressionInfoSuite use ParVector.foreach() to run Spark SQL queries in parallel. They incorrectly assume that each parallel operation will inherit the main thread’s active SparkSession. This is only true when these parallel operations run in freshly-created threads. However, when other code has already run some parallel operations before Spark was started, then there may be existing threads that do not have an active SparkSession. In that case, these tests fail with NullPointerExceptions when creating SparkPlans or running SQL queries. The fix is to use the existing method ThreadUtils.parmap(). This method creates fresh threads that inherit the current active SparkSession, and it propagates the Spark ThreadLocals. We should also add a scalastyle warning against use of ParVector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30848) Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-30848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-30848. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43161 [https://github.com/apache/spark/pull/43161] > Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13 > - > > Key: SPARK-30848 > URL: https://issues.apache.org/jira/browse/SPARK-30848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-30847 introduced a manual backport to work around a Scala issue in hash > implementation. Once we drop Scala 2.12, we can remove the fix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30848) Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-30848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-30848: Assignee: BingKun Pan > Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13 > - > > Key: SPARK-30848 > URL: https://issues.apache.org/jira/browse/SPARK-30848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-30847 introduced a manual backport to work around a Scala issue in hash > implementation. Once we drop Scala 2.12, we can remove the fix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30848) Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-30848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-30848: --- Labels: pull-request-available (was: ) > Remove manual backport of Murmur3 MurmurHash3.productHash fix from Scala 2.13 > - > > Key: SPARK-30848 > URL: https://issues.apache.org/jira/browse/SPARK-30848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > SPARK-30847 introduced a manual backport to work around a Scala issue in hash > implementation. Once we drop Scala 2.12, we can remove the fix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-45583. --- Resolution: Fixed > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Huw >Priority: Major > Fix For: 3.5.0 > > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45602) Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`
[ https://issues.apache.org/jira/browse/SPARK-45602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45602. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43445 [https://github.com/apache/spark/pull/43445] > Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys` > - > > Key: SPARK-45602 > URL: https://issues.apache.org/jira/browse/SPARK-45602 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core, SQL, YARN >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > /** Filters this map by retaining only keys satisfying a predicate. > * @param p the predicate used to test keys > * @return an immutable map consisting only of those key value pairs of > this map where the key satisfies > * the predicate `p`. The resulting map wraps the original map > without copying any elements. > */ > @deprecated("Use .view.filterKeys(f). A future version will include a strict > version of this method (for now, .view.filterKeys(p).toMap).", "2.13.0") > def filterKeys(p: K => Boolean): MapView[K, V] = new MapView.FilterKeys(this, > p) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45595) Expose SQLSTATE in error message
[ https://issues.apache.org/jira/browse/SPARK-45595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45595. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43438 [https://github.com/apache/spark/pull/43438] > Expose SQLSTATE in error message > > > Key: SPARK-45595 > URL: https://issues.apache.org/jira/browse/SPARK-45595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the > SQLSTATE is exposed; > We want to extend this to PRETTY mode, now that all errors have SQLSTATEs > We propose to trail the SQLSTATE after the text message, so it does not take > away from the reading experience of the message, while still being easily > found by tooling or humans. > [] SQLSTATE: > > Example: > {{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor > being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. SQLSTATE: 22013}} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > Other options considered have been: > {{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. }} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > {{and}} > [DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error.}} > {{{}== SQL(line 1, position 8){}}}{{{}=={}}} > {{SELECT 1/0}} > {{ ^^^}} > SQLSTATE: 22013 > }}{{{}{{}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45595) Expose SQLSTATE in error message
[ https://issues.apache.org/jira/browse/SPARK-45595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45595: --- Assignee: Serge Rielau > Expose SQLSTATE in error message > > > Key: SPARK-45595 > URL: https://issues.apache.org/jira/browse/SPARK-45595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > > When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the > SQLSTATE is exposed; > We want to extend this to PRETTY mode, now that all errors have SQLSTATEs > We propose to trail the SQLSTATE after the text message, so it does not take > away from the reading experience of the message, while still being easily > found by tooling or humans. > [] SQLSTATE: > > Example: > {{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor > being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. SQLSTATE: 22013}} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > Other options considered have been: > {{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. }} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > {{and}} > [DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error.}} > {{{}== SQL(line 1, position 8){}}}{{{}=={}}} > {{SELECT 1/0}} > {{ ^^^}} > SQLSTATE: 22013 > }}{{{}{{}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join
[ https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45509: --- Labels: pull-request-available (was: ) > Investigate the behavior difference in self-join > > > Key: SPARK-45509 > URL: https://issues.apache.org/jira/browse/SPARK-45509 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > SPARK-45220 discovers a behavior difference for a self-join scenario between > classic Spark and Spark Connect. > For instance, here is the query that works without Spark Connect: > {code:java} > df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) > df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", > height=85)]){code} > {code:java} > joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) > joined.show(){code} > But in Spark Connect, it throws this exception: > {code:java} > pyspark.errors.exceptions.connect.AnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter > with name `name` cannot be resolved. Did you mean one of the following? > [`name`, `name`, `age`, `height`].; > 'Sort ['name DESC NULLS LAST], true > +- Join FullOuter, (name#64 = name#78) >:- LocalRelation [name#64, age#65L] >+- LocalRelation [name#78, height#79L] > {code} > > On the other hand, this query failed in classic Spark Connect: > {code:java} > df.join(df, df.name == df.name, "outer").select(df.name).show() {code} > {code:java} > pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are > ambiguous... {code} > > but this query works with Spark Connect. > We need to investigate the behavior difference and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45610) Fix "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45610: - Summary: Fix "Auto-application to `()` is deprecated." (was: Handle "Auto-application to `()` is deprecated.") > Fix "Auto-application to `()` is deprecated." > - > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret = foo.isEmpty > | ^^^ > | method isEmpty in class Foo must be called with () argument > | > | longer explanation available when compiling with `-explain` > 1 error found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45615) Remove redundant"Auto-application to `()` is deprecated" compile suppression rules.
Yang Jie created SPARK-45615: Summary: Remove redundant"Auto-application to `()` is deprecated" compile suppression rules. Key: SPARK-45615 URL: https://issues.apache.org/jira/browse/SPARK-45615 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie Due to the issue https://github.com/scalatest/scalatest/issues/2297, we need to wait until we upgrade a scalatest version before removing these suppression rules. Maybe 3.2.18 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)
[ https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-39910: --- Labels: DataFrameReader pull-request-available (was: DataFrameReader) > DataFrameReader API cannot read files from hadoop archives (.har) > - > > Key: SPARK-39910 > URL: https://issues.apache.org/jira/browse/SPARK-39910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Christophe Préaud >Priority: Minor > Labels: DataFrameReader, pull-request-available > > Reading a file from an hadoop archive using the DataFrameReader API returns > an empty Dataset: > {code:java} > scala> val df = > spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") > df: org.apache.spark.sql.Dataset[String] = [value: string] > scala> df.count > res7: Long = 0 {code} > > On the other hand, reading the same file, from the same hadoop archive, but > using the RDD API yields the correct result: > {code:java} > scala> val df = > sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") > df: org.apache.spark.sql.DataFrame = [value: string] > scala> df.count > res8: Long = 5589 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45592: -- Assignee: (was: Apache Spark) > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45592: -- Assignee: Apache Spark > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45609) Include SqlState in SparkThrowable proto message
[ https://issues.apache.org/jira/browse/SPARK-45609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45609. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43457 [https://github.com/apache/spark/pull/43457] > Include SqlState in SparkThrowable proto message > > > Key: SPARK-45609 > URL: https://issues.apache.org/jira/browse/SPARK-45609 > Project: Spark > Issue Type: Test > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45609) Include SqlState in SparkThrowable proto message
[ https://issues.apache.org/jira/browse/SPARK-45609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45609: Assignee: Yihong He > Include SqlState in SparkThrowable proto message > > > Key: SPARK-45609 > URL: https://issues.apache.org/jira/browse/SPARK-45609 > Project: Spark > Issue Type: Test > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1591#comment-1591 ] Yuming Wang commented on SPARK-43851: - The resolution should be unresolved. > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-43851: - Assignee: (was: Jia Fan) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43851: Fix Version/s: (was: 3.5.0) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-45613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45613. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43461 [https://github.com/apache/spark/pull/43461] > Expose DeterministicLevel as a DeveloperApi > --- > > Key: SPARK-45613 > URL: https://issues.apache.org/jira/browse/SPARK-45613 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can > override to specify the {{DeterministicLevel}} of the {{RDD}}. > Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}. > Expose {{DeterministicLevel}} to allow users to users this method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-45613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45613: Assignee: Mridul Muralidharan > Expose DeterministicLevel as a DeveloperApi > --- > > Key: SPARK-45613 > URL: https://issues.apache.org/jira/browse/SPARK-45613 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > Labels: pull-request-available > > {{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can > override to specify the {{DeterministicLevel}} of the {{RDD}}. > Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}. > Expose {{DeterministicLevel}} to allow users to users this method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org