[jira] [Updated] (SPARK-45467) Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass`
[ https://issues.apache.org/jira/browse/SPARK-45467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45467: --- Labels: pull-request-available (was: ) > Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` > > > Key: SPARK-45467 > URL: https://issues.apache.org/jira/browse/SPARK-45467 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > {code:java} > * @deprecated Proxy classes generated in a named module are encapsulated > * and not accessible to code outside its module. > * {@link Constructor#newInstance(Object...) Constructor.newInstance} > * will throw {@code IllegalAccessException} when it is called on > * an inaccessible proxy class. > * Use {@link #newProxyInstance(ClassLoader, Class[], InvocationHandler)} > * to create a proxy instance instead. > * > * @see Package and Module Membership of Proxy Class > * @revised 9 > */ > @Deprecated > @CallerSensitive > public static Class getProxyClass(ClassLoader loader, > Class... interfaces) > throws IllegalArgumentException {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45464) [CORE] Fix yarn distribution build
[ https://issues.apache.org/jira/browse/SPARK-45464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45464: --- Labels: pull-request-available (was: ) > [CORE] Fix yarn distribution build > -- > > Key: SPARK-45464 > URL: https://issues.apache.org/jira/browse/SPARK-45464 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/spark/pull/43164] introduced a regression in: > > ``` > ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn > ``` > > this needs to be fixed -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45467) Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass`
[ https://issues.apache.org/jira/browse/SPARK-45467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45467: - Description: {code:java} * @deprecated Proxy classes generated in a named module are encapsulated * and not accessible to code outside its module. * {@link Constructor#newInstance(Object...) Constructor.newInstance} * will throw {@code IllegalAccessException} when it is called on * an inaccessible proxy class. * Use {@link #newProxyInstance(ClassLoader, Class[], InvocationHandler)} * to create a proxy instance instead. * * @see Package and Module Membership of Proxy Class * @revised 9 */ @Deprecated @CallerSensitive public static Class getProxyClass(ClassLoader loader, Class... interfaces) throws IllegalArgumentException {code} > Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` > > > Key: SPARK-45467 > URL: https://issues.apache.org/jira/browse/SPARK-45467 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > * @deprecated Proxy classes generated in a named module are encapsulated > * and not accessible to code outside its module. > * {@link Constructor#newInstance(Object...) Constructor.newInstance} > * will throw {@code IllegalAccessException} when it is called on > * an inaccessible proxy class. > * Use {@link #newProxyInstance(ClassLoader, Class[], InvocationHandler)} > * to create a proxy instance instead. > * > * @see Package and Module Membership of Proxy Class > * @revised 9 > */ > @Deprecated > @CallerSensitive > public static Class getProxyClass(ClassLoader loader, > Class... interfaces) > throws IllegalArgumentException {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45467) Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass`
Yang Jie created SPARK-45467: Summary: Replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` Key: SPARK-45467 URL: https://issues.apache.org/jira/browse/SPARK-45467 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45466) VectorAssembler should validate the vector values
[ https://issues.apache.org/jira/browse/SPARK-45466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45466: --- Labels: pull-request-available (was: ) > VectorAssembler should validate the vector values > - > > Key: SPARK-45466 > URL: https://issues.apache.org/jira/browse/SPARK-45466 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45466) VectorAssembler should validate the vector values
Ruifeng Zheng created SPARK-45466: - Summary: VectorAssembler should validate the vector values Key: SPARK-45466 URL: https://issues.apache.org/jira/browse/SPARK-45466 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45465) Upgrade kubernetes-client to 6.9.0 for K8s 1.28
[ https://issues.apache.org/jira/browse/SPARK-45465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45465. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43284 [https://github.com/apache/spark/pull/43284] > Upgrade kubernetes-client to 6.9.0 for K8s 1.28 > --- > > Key: SPARK-45465 > URL: https://issues.apache.org/jira/browse/SPARK-45465 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773090#comment-17773090 ] BingKun Pan commented on SPARK-45428: - Okay, let me investigate it. > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page
[ https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773088#comment-17773088 ] BingKun Pan commented on SPARK-44729: - The historical versions includes: 3.1.1 3.1.2 3.1.3 3.2.0 3.2.1 3.2.2 3.2.3 3.2.4 3.3.0 3.3.1 3.3.2 3.3.3 3.4.0 3.4.1 > Add canonical links to the PySpark docs page > > > Key: SPARK-44729 > URL: https://issues.apache.org/jira/browse/SPARK-44729 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0, 4.0.0 > > > We should add the canonical link to the PySpark docs page > [https://spark.apache.org/docs/latest/api/python/index.html] so that the > search engine can return the latest PySpark docs. > Then, we need to update all released documentation pages to add the canonical > link pointing to the latest spark documentation of the API (such as group > by). Currently, if you Google pyspark groupby, Google will return the docs > page from 3.1.1, which is not ideal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773078#comment-17773078 ] Ruifeng Zheng commented on SPARK-45428: --- [~panbingkun] would you mind taking a look? thanks! > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773074#comment-17773074 ] XiDuo You commented on SPARK-45443: --- > Can this increase probability of concurrent IMR materialization for same IMR > instance? I think they are same, The TableCacheQueryStage is more like a barrier and report some metrics to AQE framework. The gap of `eagerly` is very small. > For queries using AQE, can introducing TableCacheQueryStage into physical > plan once per unique IMR instance help I did not see the difference. I think one idea is, we can introduce something like `ReusedTableCacheQueryStage`. The `ReusedTableCacheQueryStage` only holds an empty future which wait for the first TableCacheQueryStage materialization, so that we can make sure the cached RDD only be executed once. But this idea only work for one query, say, if there are multi-queries which reference the same caced RDD (e.g., in thiftserver), the issue is still existed. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter") > df4.show() {code} > *Physical Plan:* > {code:java} > == Physical Plan == > AdaptiveSparkPlan (31) > +- == Final Plan == > CollectLimit (21) > +- * Project (20) > +- * SortMergeJoin FullOuter (19) > :- * Sort (10) > : +- * Filter (9) > : +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, > rowCount=5) > : +- InMemoryTableScan (1) > : +- InMemoryRelation (2) > : +- AdaptiveSparkPlan (7) > : +- HashAggregate (6) > : +- Exchange (5) > : +- HashAggregate (4) > : +- LocalTableScan (3) > +- * Sort (18) > +- * Filter (17) > +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, > rowCount=5) > +- InMemoryTableScan (11) > +- InMemoryRelation (12) > +- AdaptiveSparkPlan (15) > +- HashAggregate (14) > +- Exchange (13) > +- HashAggregate (4) > +- LocalTableScan (3) {code} > *Stages DAGs materializing the same IMR instance:* > !IMR Materialization - Stage 2.png|width=303,height=134! > !IMR Materialization - Stage 3.png|width=303,height=134! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44729) Add canonical links to the PySpark docs page
[ https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44729: --- Labels: pull-request-available (was: ) > Add canonical links to the PySpark docs page > > > Key: SPARK-44729 > URL: https://issues.apache.org/jira/browse/SPARK-44729 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0, 4.0.0 > > > We should add the canonical link to the PySpark docs page > [https://spark.apache.org/docs/latest/api/python/index.html] so that the > search engine can return the latest PySpark docs. > Then, we need to update all released documentation pages to add the canonical > link pointing to the latest spark documentation of the API (such as group > by). Currently, if you Google pyspark groupby, Google will return the docs > page from 3.1.1, which is not ideal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
[ https://issues.apache.org/jira/browse/SPARK-42716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42716: --- Labels: pull-request-available (was: ) > DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per > partition > -- > > Key: SPARK-42716 > URL: https://issues.apache.org/jira/browse/SPARK-42716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1 >Reporter: Enrico Minack >Priority: Major > Labels: pull-request-available > > From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as > {{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if > multiple keys belong to a partition. > With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the > partition information reported through {{SupportsReportPartitioning}} is > considered by catalyst. But this limits the number of keys per partition to 1. > Spark should continue to support the more general situation of > {{KeyGroupedPartitioning}} with multiple keys per partition, like > {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44527. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42129 [https://github.com/apache/spark/pull/42129] > Simplify BinaryComparison if its children contain ScalarSubquery with empty > output > -- > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44527: --- Labels: pull-request-available (was: ) > Simplify BinaryComparison if its children contain ScalarSubquery with empty > output > -- > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44527) Replace ScalarSubquery with null if its maxRows is 0
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44527: -- Summary: Replace ScalarSubquery with null if its maxRows is 0 (was: Simplify BinaryComparison if its children contain ScalarSubquery with empty output) > Replace ScalarSubquery with null if its maxRows is 0 > > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44527: - Assignee: Yuming Wang > Simplify BinaryComparison if its children contain ScalarSubquery with empty > output > -- > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45465) Upgrade kubernetes-client to 6.9.0 for K8s 1.28
[ https://issues.apache.org/jira/browse/SPARK-45465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45465: - Assignee: Dongjoon Hyun > Upgrade kubernetes-client to 6.9.0 for K8s 1.28 > --- > > Key: SPARK-45465 > URL: https://issues.apache.org/jira/browse/SPARK-45465 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45465) Upgrade kubernetes-client to 6.9.0 for K8s 1.28
[ https://issues.apache.org/jira/browse/SPARK-45465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45465: --- Labels: pull-request-available (was: ) > Upgrade kubernetes-client to 6.9.0 for K8s 1.28 > --- > > Key: SPARK-45465 > URL: https://issues.apache.org/jira/browse/SPARK-45465 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45465) Upgrade kubernetes-client to 6.9.0 for K8s 1.28
[ https://issues.apache.org/jira/browse/SPARK-45465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45465: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Upgrade kubernetes-client to 6.9.0 for K8s 1.28 > --- > > Key: SPARK-45465 > URL: https://issues.apache.org/jira/browse/SPARK-45465 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45465) Upgrade kubernetes-client to 6.9.0 for K8s 1.28
Dongjoon Hyun created SPARK-45465: - Summary: Upgrade kubernetes-client to 6.9.0 for K8s 1.28 Key: SPARK-45465 URL: https://issues.apache.org/jira/browse/SPARK-45465 Project: Spark Issue Type: Improvement Components: Build, Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45457) Surface sc.setLocalProperty() value=NULL param meaning
[ https://issues.apache.org/jira/browse/SPARK-45457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45457. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43269 [https://github.com/apache/spark/pull/43269] > Surface sc.setLocalProperty() value=NULL param meaning > -- > > Key: SPARK-45457 > URL: https://issues.apache.org/jira/browse/SPARK-45457 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > sc.setLocalProperty() has special meaning/feature and when supplied then > removes the property associated to key parameter. It's only mentioned in the > Fair Scheduler section of the documentation. > It would be nice to document that on the API's as well for users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45460) Replace `scala.collection.convert.ImplicitConversions` to `scala.jdk.CollectionConverters`
[ https://issues.apache.org/jira/browse/SPARK-45460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45460. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43275 [https://github.com/apache/spark/pull/43275] > Replace `scala.collection.convert.ImplicitConversions` to > `scala.jdk.CollectionConverters` > -- > > Key: SPARK-45460 > URL: https://issues.apache.org/jira/browse/SPARK-45460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:42 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and can introduce potential regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solution options (if makes sense): For queries using AQE, can introducing TableCacheQueryStage into physical plan once per unique IMR instance help? IMR instances can be compared if they are equivalent before its TableCacheQueryStage instance is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and can introduce potential regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options (if makes sense): For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) >
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:40 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and can introduce potential regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options (if makes sense): For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and can introduce potential regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) >
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:33 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and can introduce potential regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") >
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:29 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec and TableCacheQueryStage can materialize unique IMR instance once. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 =
[jira] [Created] (SPARK-45464) [CORE] Fix yarn distribution build
Hasnain Lakhani created SPARK-45464: --- Summary: [CORE] Fix yarn distribution build Key: SPARK-45464 URL: https://issues.apache.org/jira/browse/SPARK-45464 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 4.0.0 Reporter: Hasnain Lakhani [https://github.com/apache/spark/pull/43164] introduced a regression in: ``` ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn ``` this needs to be fixed -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:27 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id",
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:25 PM: - Hi [~ulysses], Firstly, thanks for reply. For above sample query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For queries using AQE like above query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4
[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari edited comment on SPARK-45443 at 10/8/23 8:24 PM: - Hi [~ulysses], Firstly, thanks for reply. For queries using AQE like above query, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. was (Author: erenavsarogullari): Hi [~ulysses], Firstly, thanks for reply. For queries using AQE, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4
[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773050#comment-17773050 ] Eren Avsarogullari commented on SPARK-45443: Hi [~ulysses], Firstly, thanks for reply. For queries using AQE, if TableCacheQueryStage flow is disabled, IMR materialization will be triggered by ShuffleQueryStage (introduced by Sort' s Exchange node). Both ShuffleQueryStage nodes will also need to materialize same IMR instance in this case so i believe same issue may also occur in previous flow. TableCacheQueryStage materializes IMR eagerly as different from previous flow. Can this increase probability of concurrent IMR materialization for same IMR instance? I think this behavior is not visible when IMR cached data size is low. However, replicated IMR materialization can be expensive and introduce regression when IMR cached data size is high (e.g: observing this behavior when IMR needs to read high shuffle data size). Also, the queries can have multiple IMR instances by referencing multiple replicated IMR instances, this can also increase probability of concurrent IMR materialization for same IMR instance. Thinking on potential following solutions options if makes sense: For queries using AQE, can introducing TableCacheQueryStage into PhysicalPlan once per unique IMR instance help? IMR instances can be compared if they are equivalent before TableCacheQueryStage is created by AdaptiveSparkPlanExec. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter") > df4.show() {code} > *Physical Plan:* > {code:java} > == Physical Plan == > AdaptiveSparkPlan (31) > +- == Final Plan == > CollectLimit (21) > +- * Project (20) > +- * SortMergeJoin FullOuter (19) > :- * Sort (10) > : +- * Filter (9) > : +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, > rowCount=5) > : +- InMemoryTableScan (1) > : +- InMemoryRelation (2) > : +- AdaptiveSparkPlan (7) > : +- HashAggregate (6) > : +- Exchange (5) > : +- HashAggregate (4) > : +- LocalTableScan (3) > +- * Sort (18) > +- * Filter (17) > +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, > rowCount=5) > +- InMemoryTableScan (11) > +- InMemoryRelation (12) > +- AdaptiveSparkPlan (15) > +- HashAggregate (14) > +- Exchange (13) > +- HashAggregate (4) > +- LocalTableScan (3) {code} > *Stages DAGs
[jira] [Resolved] (SPARK-45461) Introduce a mapper for StorageLevel
[ https://issues.apache.org/jira/browse/SPARK-45461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45461. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43278 [https://github.com/apache/spark/pull/43278] > Introduce a mapper for StorageLevel > --- > > Key: SPARK-45461 > URL: https://issues.apache.org/jira/browse/SPARK-45461 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, StorageLevel provides fromString to get the StorageLevel's > instance with StorageLevel's name. So developers or users have to copy the > string literal about StorageLevel's name to set or get instance of > StorageLevel. This issue lead to developers need to manually maintain its > consistency. It is easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43704) Enable IndexesParityTests.test_to_series
[ https://issues.apache.org/jira/browse/SPARK-43704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43704: - Assignee: Haejoon Lee > Enable IndexesParityTests.test_to_series > > > Key: SPARK-43704 > URL: https://issues.apache.org/jira/browse/SPARK-43704 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43704) Enable IndexesParityTests.test_to_series
[ https://issues.apache.org/jira/browse/SPARK-43704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43704. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43228 [https://github.com/apache/spark/pull/43228] > Enable IndexesParityTests.test_to_series > > > Key: SPARK-43704 > URL: https://issues.apache.org/jira/browse/SPARK-43704 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45462) Show `Duration` in `ApplicationPage`
[ https://issues.apache.org/jira/browse/SPARK-45462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45462. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43279 [https://github.com/apache/spark/pull/43279] > Show `Duration` in `ApplicationPage` > > > Key: SPARK-45462 > URL: https://issues.apache.org/jira/browse/SPARK-45462 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45462) Show `Duration` in `ApplicationPage`
[ https://issues.apache.org/jira/browse/SPARK-45462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45462: - Assignee: Dongjoon Hyun > Show `Duration` in `ApplicationPage` > > > Key: SPARK-45462 > URL: https://issues.apache.org/jira/browse/SPARK-45462 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45413) Add warning for prepare drop LevelDB support
[ https://issues.apache.org/jira/browse/SPARK-45413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45413. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43217 [https://github.com/apache/spark/pull/43217] > Add warning for prepare drop LevelDB support > > > Key: SPARK-45413 > URL: https://issues.apache.org/jira/browse/SPARK-45413 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add warning for prepare drop LevelDB support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45413) Add warning for prepare drop LevelDB support
[ https://issues.apache.org/jira/browse/SPARK-45413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45413: - Assignee: Jia Fan > Add warning for prepare drop LevelDB support > > > Key: SPARK-45413 > URL: https://issues.apache.org/jira/browse/SPARK-45413 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > > Add warning for prepare drop LevelDB support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45454) Set the table's default owner to current_user
[ https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45454. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43264 [https://github.com/apache/spark/pull/43264] > Set the table's default owner to current_user > - > > Key: SPARK-45454 > URL: https://issues.apache.org/jira/browse/SPARK-45454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId
[ https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45463: --- Labels: pull-request-available (was: ) > Allow ShuffleDriverComponent to support reliable store with specified > executorId > > > Key: SPARK-45463 > URL: https://issues.apache.org/jira/browse/SPARK-45463 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: zhoubin >Priority: Major > Labels: pull-request-available > > After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is > determined globally. > Downstream projects may have different shuffle policy(cause by cluster loads > or columnar support) for different stages, for example or Apache Uniffle with > Gluten or Apache Celeborn. > In this situation, ShuffleDriverComponent should use the mapTrackerMaster to > decide weather support reliable storage by the specified executorId -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId
[ https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoubin updated SPARK-45463: Summary: Allow ShuffleDriverComponent to support reliable store with specified executorId (was: Allow ShuffleDriverComponent to decide weather shuffle data is reliably stored when different stage has different policy) > Allow ShuffleDriverComponent to support reliable store with specified > executorId > > > Key: SPARK-45463 > URL: https://issues.apache.org/jira/browse/SPARK-45463 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: zhoubin >Priority: Major > > After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is > determined globally. > Downstream projects may have different shuffle policy(cause by cluster loads > or columnar support) for different stages, for example or Apache Uniffle with > Gluten or Apache Celeborn. > In this situation, ShuffleDriverComponent should use the mapTrackerMaster to > decide weather support reliable storage by the specified executorId -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45463) Allow ShuffleDriverComponent to decide weather shuffle data is reliably stored when different stage has different policy
[ https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoubin updated SPARK-45463: Description: After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is determined globally. Downstream projects may have different shuffle policy(cause by cluster loads or columnar support) for different stages, for example or Apache Uniffle with Gluten or Apache Celeborn. In this situation, ShuffleDriverComponent should use the mapTrackerMaster to decide weather support reliable storage by the specified executorId was: After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is determined globally. Downstream projects may have different shuffle policy to adapt to cluster loads for different stages, for example or Apache Uniffle with Gluten or Apache Celeborn. In this situation, ShuffleDriverComponent should use the mapTrackerMaster to decide weather support reliable storage by the specified executorId > Allow ShuffleDriverComponent to decide weather shuffle data is reliably > stored when different stage has different policy > > > Key: SPARK-45463 > URL: https://issues.apache.org/jira/browse/SPARK-45463 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: zhoubin >Priority: Major > > After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is > determined globally. > Downstream projects may have different shuffle policy(cause by cluster loads > or columnar support) for different stages, for example or Apache Uniffle with > Gluten or Apache Celeborn. > In this situation, ShuffleDriverComponent should use the mapTrackerMaster to > decide weather support reliable storage by the specified executorId -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45463) Allow ShuffleDriverComponent to decide weather shuffle data is reliably stored when different stage has different policy
zhoubin created SPARK-45463: --- Summary: Allow ShuffleDriverComponent to decide weather shuffle data is reliably stored when different stage has different policy Key: SPARK-45463 URL: https://issues.apache.org/jira/browse/SPARK-45463 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0, 3.5.1 Reporter: zhoubin After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is determined globally. Downstream projects may have different shuffle policy to adapt to cluster loads for different stages, for example or Apache Uniffle with Gluten or Apache Celeborn. In this situation, ShuffleDriverComponent should use the mapTrackerMaster to decide weather support reliable storage by the specified executorId -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45462) Show `Duration` in `ApplicationPage`
[ https://issues.apache.org/jira/browse/SPARK-45462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45462: -- Summary: Show `Duration` in `ApplicationPage` (was: Show `Duration` in ApplicationPage) > Show `Duration` in `ApplicationPage` > > > Key: SPARK-45462 > URL: https://issues.apache.org/jira/browse/SPARK-45462 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45462) Show `Duration` in `ApplicationPage`
[ https://issues.apache.org/jira/browse/SPARK-45462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45462: --- Labels: pull-request-available (was: ) > Show `Duration` in `ApplicationPage` > > > Key: SPARK-45462 > URL: https://issues.apache.org/jira/browse/SPARK-45462 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45462) Show `Duration` in ApplicationPage
Dongjoon Hyun created SPARK-45462: - Summary: Show `Duration` in ApplicationPage Key: SPARK-45462 URL: https://issues.apache.org/jira/browse/SPARK-45462 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45352) Eliminate foldable window partitions
[ https://issues.apache.org/jira/browse/SPARK-45352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuml updated SPARK-45352: -- Summary: Eliminate foldable window partitions (was: Remove window partition if partition expression are foldable) > Eliminate foldable window partitions > > > Key: SPARK-45352 > URL: https://issues.apache.org/jira/browse/SPARK-45352 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: zhuml >Priority: Major > Labels: pull-request-available > > Foldable partition is redundant, remove it not only can simplify plan, but > some rules can also take effect when the partitions are all foldable, such as > `LimitPushDownThroughWindow{{{}`{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45461) Introduce a mapper for StorageLevel
[ https://issues.apache.org/jira/browse/SPARK-45461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng reassigned SPARK-45461: -- Assignee: Jiaan Geng > Introduce a mapper for StorageLevel > --- > > Key: SPARK-45461 > URL: https://issues.apache.org/jira/browse/SPARK-45461 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Currently, StorageLevel provides fromString to get the StorageLevel's > instance with StorageLevel's name. So developers or users have to copy the > string literal about StorageLevel's name to set or get instance of > StorageLevel. This issue lead to developers need to manually maintain its > consistency. It is easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45461) Introduce a mapper for StorageLevel
[ https://issues.apache.org/jira/browse/SPARK-45461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45461: --- Labels: pull-request-available (was: ) > Introduce a mapper for StorageLevel > --- > > Key: SPARK-45461 > URL: https://issues.apache.org/jira/browse/SPARK-45461 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Currently, StorageLevel provides fromString to get the StorageLevel's > instance with StorageLevel's name. So developers or users have to copy the > string literal about StorageLevel's name to set or get instance of > StorageLevel. This issue lead to developers need to manually maintain its > consistency. It is easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45461) Introduce a mapper for StorageLevel
Jiaan Geng created SPARK-45461: -- Summary: Introduce a mapper for StorageLevel Key: SPARK-45461 URL: https://issues.apache.org/jira/browse/SPARK-45461 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core, SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Currently, StorageLevel provides fromString to get the StorageLevel's instance with StorageLevel's name. So developers or users have to copy the string literal about StorageLevel's name to set or get instance of StorageLevel. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file
[ https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45459: -- Assignee: Apache Spark > Remove the last 2 extra spaces in the automatically generated > `sql-error-conditions.md` file > > > Key: SPARK-45459 > URL: https://issues.apache.org/jira/browse/SPARK-45459 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file
[ https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45459: -- Assignee: (was: Apache Spark) > Remove the last 2 extra spaces in the automatically generated > `sql-error-conditions.md` file > > > Key: SPARK-45459 > URL: https://issues.apache.org/jira/browse/SPARK-45459 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45460) Replace `scala.collection.convert.ImplicitConversions` to `scala.jdk.CollectionConverters`
[ https://issues.apache.org/jira/browse/SPARK-45460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45460: --- Labels: pull-request-available (was: ) > Replace `scala.collection.convert.ImplicitConversions` to > `scala.jdk.CollectionConverters` > -- > > Key: SPARK-45460 > URL: https://issues.apache.org/jira/browse/SPARK-45460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45460) Replace `scala.collection.convert.ImplicitConversions` to `scala.jdk.CollectionConverters`
BingKun Pan created SPARK-45460: --- Summary: Replace `scala.collection.convert.ImplicitConversions` to `scala.jdk.CollectionConverters` Key: SPARK-45460 URL: https://issues.apache.org/jira/browse/SPARK-45460 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file
[ https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45459: --- Labels: pull-request-available (was: ) > Remove the last 2 extra spaces in the automatically generated > `sql-error-conditions.md` file > > > Key: SPARK-45459 > URL: https://issues.apache.org/jira/browse/SPARK-45459 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39600) Enhance pushdown limit through window
[ https://issues.apache.org/jira/browse/SPARK-39600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-39600: --- Labels: pull-request-available (was: ) > Enhance pushdown limit through window > - > > Key: SPARK-39600 > URL: https://issues.apache.org/jira/browse/SPARK-39600 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > > Improve TPC-DS q67 performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file
BingKun Pan created SPARK-45459: --- Summary: Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file Key: SPARK-45459 URL: https://issues.apache.org/jira/browse/SPARK-45459 Project: Spark Issue Type: Improvement Components: Documentation, SQL, Tests Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45455) [SQL][JDBC] Improve the rename interface of Postgres Dialect
[ https://issues.apache.org/jira/browse/SPARK-45455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 蔡灿材 updated SPARK-45455: Description: Improve the rename interface of pgdialect (was: Improve the rename interface of pgdialect and mysqldialec [MySQL :: MySQL 8.0 Reference Manual :: 13.1.36 RENAME TABLE Statement|https://dev.mysql.com/doc/refman/8.0/en/rename-table.html]) > [SQL][JDBC] Improve the rename interface of Postgres Dialect > > > Key: SPARK-45455 > URL: https://issues.apache.org/jira/browse/SPARK-45455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: 蔡灿材 >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > Improve the rename interface of pgdialect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45458) Convert IllegalArgumentException to SparkIllegalArgumentException in bitwiseExpressions and add some UT
[ https://issues.apache.org/jira/browse/SPARK-45458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45458: --- Labels: pull-request-available (was: ) > Convert IllegalArgumentException to SparkIllegalArgumentException in > bitwiseExpressions and add some UT > --- > > Key: SPARK-45458 > URL: https://issues.apache.org/jira/browse/SPARK-45458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45458) Convert IllegalArgumentException to SparkIllegalArgumentException in bitwiseExpressions and add some UT
BingKun Pan created SPARK-45458: --- Summary: Convert IllegalArgumentException to SparkIllegalArgumentException in bitwiseExpressions and add some UT Key: SPARK-45458 URL: https://issues.apache.org/jira/browse/SPARK-45458 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org