[jira] [Updated] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
[ https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-49300: - Fix Version/s: 3.5.3 > Fix Hadoop delegation token leak when tokenRenewalInterval is not set. > -- > > Key: SPARK-49300 > URL: https://issues.apache.org/jira/browse/SPARK-49300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.3 >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > If tokenRenewalInterval is not set, > HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some > tokens and renew them to get a interval value. These tokens do not call > cancel(), resulting in a large number of existing tokens on HDFS not being > cleared in a timely manner, causing additional pressure on the HDFS server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
[ https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-49300: Assignee: Shuyan Zhang > Fix Hadoop delegation token leak when tokenRenewalInterval is not set. > -- > > Key: SPARK-49300 > URL: https://issues.apache.org/jira/browse/SPARK-49300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.3 >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If tokenRenewalInterval is not set, > HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some > tokens and renew them to get a interval value. These tokens do not call > cancel(), resulting in a large number of existing tokens on HDFS not being > cleared in a timely manner, causing additional pressure on the HDFS server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
[ https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-49300. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47800 [https://github.com/apache/spark/pull/47800] > Fix Hadoop delegation token leak when tokenRenewalInterval is not set. > -- > > Key: SPARK-49300 > URL: https://issues.apache.org/jira/browse/SPARK-49300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.3 >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If tokenRenewalInterval is not set, > HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some > tokens and renew them to get a interval value. These tokens do not call > cancel(), resulting in a large number of existing tokens on HDFS not being > cleared in a timely manner, causing additional pressure on the HDFS server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49068) "Externally Managed Environment" error when building PySpark Docker image
Chao Sun created SPARK-49068: Summary: "Externally Managed Environment" error when building PySpark Docker image Key: SPARK-49068 URL: https://issues.apache.org/jira/browse/SPARK-49068 Project: Spark Issue Type: Bug Components: Spark Docker Affects Versions: 3.5.1 Reporter: Chao Sun When trying to build Docker image based on PySpark Dockerfile in Ubuntu 20.04, I got the following error: {code} #7 19.13 error: externally-managed-environment #7 19.13 #7 19.13 × This environment is externally managed #7 19.13 ╰─> To install Python packages system-wide, try apt install #7 19.13 python3-xyz, where xyz is the package you are trying to #7 19.13 install. #7 19.13 #7 19.13 If you wish to install a non-Debian-packaged Python package, #7 19.13 create a virtual environment using python3 -m venv path/to/venv. #7 19.13 Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make #7 19.13 sure you have python3-full installed. #7 19.13 #7 19.13 If you wish to install a non-Debian packaged Python application, #7 19.13 it may be easiest to use pipx install xyz, which will manage a #7 19.13 virtual environment for you. Make sure you have pipx installed. #7 19.13 #7 19.13 See /usr/share/doc/python3.12/README.venv for more information. #7 19.13 #7 19.13 note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages. #7 19.13 hint: See PEP 668 for the detailed specification. #7 ERROR: process "/bin/sh -c apt-get update && apt install -y python3 python3-pip && rm -rf /usr/lib/python3.11/EXTERNALLY-MANAGED && pip3 install --upgrade pip setuptools && rm -rf /root/.cache && rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 1 {code} Looking at the [Dockerfile|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile], it does the following: {code} RUN apt-get update && \ apt install -y python3 python3-pip && \ pip3 install --upgrade pip setuptools && \ # Removed the .cache to save space rm -rf /root/.cache && rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/* {code} If {{pip}} was installed by the system package manager, and then we are trying to overwrite it via {{pip3 install}}, the error could happen. A simple solution would be to create a virtual environment first, install the latest pip there, and then update {{PATH}} to use that instead. Wonder if anyone else has encountered the same issue and whether it is a good idea to fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48613) Umbrella: Storage Partition Join Improvements
[ https://issues.apache.org/jira/browse/SPARK-48613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48613. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47064 [https://github.com/apache/spark/pull/47064] > Umbrella: Storage Partition Join Improvements > - > > Key: SPARK-48613 > URL: https://issues.apache.org/jira/browse/SPARK-48613 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries
[ https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48655: Assignee: Szehon Ho > SPJ: Add tests for shuffle skipping for aggregate queries > - > > Key: SPARK-48655 > URL: https://issues.apache.org/jira/browse/SPARK-48655 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle
[ https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48012: Assignee: Szehon Ho > SPJ: Support Transfrom Expressions for One Side Shuffle > --- > > Key: SPARK-48012 > URL: https://issues.apache.org/jira/browse/SPARK-48012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > > SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if > the other side is KeyGroupedPartitioning. However, the support was just for > a KeyGroupedPartition without any partition transform (day, year, bucket). > It will be useful to add support for partition transform as well, as there > are many tables partitioned by those transforms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle
[ https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48012. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46255 [https://github.com/apache/spark/pull/46255] > SPJ: Support Transfrom Expressions for One Side Shuffle > --- > > Key: SPARK-48012 > URL: https://issues.apache.org/jira/browse/SPARK-48012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if > the other side is KeyGroupedPartitioning. However, the support was just for > a KeyGroupedPartition without any partition transform (day, year, bucket). > It will be useful to add support for partition transform as well, as there > are many tables partitioned by those transforms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48392) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-48392: - Summary: Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file` (was: (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`) > Load `spark-defaults.conf` when passing configurations to `spark-submit` > through `--properties-file` > > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48392. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46709 [https://github.com/apache/spark/pull/46709] > (Optionally) Load `spark-defaults.conf` when passing configurations to > `spark-submit` through `--properties-file` > - > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48392: Assignee: Chao Sun > (Optionally) Load `spark-defaults.conf` when passing configurations to > `spark-submit` through `--properties-file` > - > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
Chao Sun created SPARK-48392: Summary: (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file` Key: SPARK-48392 URL: https://issues.apache.org/jira/browse/SPARK-48392 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.1 Reporter: Chao Sun Currently, when user pass configurations to `spark-submit.sh` via `--properties-file`, the `spark-defaults.conf` will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`
[ https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-48392: - Description: Currently, when user pass configurations to {{spark-submit.sh}} via {{--properties-file}}, the {{spark-defaults.conf}} will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 was: Currently, when user pass configurations to `spark-submit.sh` via `--properties-file`, the `spark-defaults.conf` will be completely ignored. This poses issues for some people, for instance, those using [Spark on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: - https://github.com/kubeflow/spark-operator/issues/1183 - https://github.com/kubeflow/spark-operator/issues/1321 > (Optionally) Load `spark-defaults.conf` when passing configurations to > `spark-submit` through `--properties-file` > - > > Key: SPARK-48392 > URL: https://issues.apache.org/jira/browse/SPARK-48392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Chao Sun >Priority: Major > > Currently, when user pass configurations to {{spark-submit.sh}} via > {{--properties-file}}, the {{spark-defaults.conf}} will be completely > ignored. This poses issues for some people, for instance, those using [Spark > on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. > See related issues: > - https://github.com/kubeflow/spark-operator/issues/1183 > - https://github.com/kubeflow/spark-operator/issues/1321 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
[ https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48065. -- Resolution: Fixed Issue resolved by pull request 46325 [https://github.com/apache/spark/pull/46325] > SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict > - > > Key: SPARK-48065 > URL: https://issues.apache.org/jira/browse/SPARK-48065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, > then SPJ no longer triggers if there are more join keys than partition keys. > It is triggered only if join keys is equal to , or less than, partition keys. > > We can relax this constraint, as this case was supported if the flag is not > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
[ https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48065: Assignee: Szehon Ho > SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict > - > > Key: SPARK-48065 > URL: https://issues.apache.org/jira/browse/SPARK-48065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, > then SPJ no longer triggers if there are more join keys than partition keys. > It is triggered only if join keys is equal to , or less than, partition keys. > > We can relax this constraint, as this case was supported if the flag is not > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48030) InternalRowComparableWrapper should cache rowOrdering to improve performace
[ https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48030. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46265 [https://github.com/apache/spark/pull/46265] > InternalRowComparableWrapper should cache rowOrdering to improve performace > --- > > Key: SPARK-48030 > URL: https://issues.apache.org/jira/browse/SPARK-48030 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1, 3.4.3 >Reporter: YE >Assignee: YE >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: screenshot-1.png > > > InternalRowComparableWrapper recreates row ordering for each output partition > when SPJ is enabled. The row ordering is generated via codegen which is quite > expensive and the output partitions might be quite large for production table > such as hundreds of thousands partitions. We encountered this issue when > applying SPJ with multiple large Iceberg tables and the plan phase took tens > of minutes to complete. > Attaching a screenshot to provide related stack trace: > !screenshot-1.png! > A simple fix for this would be caching the rowOrdering for > InternalRowComparableWrapper as the datatype of the InternalRow is immutable -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48030) InternalRowComparableWrapper should cache rowOrdering to improve performace
[ https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48030: Assignee: YE > InternalRowComparableWrapper should cache rowOrdering to improve performace > --- > > Key: SPARK-48030 > URL: https://issues.apache.org/jira/browse/SPARK-48030 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1, 3.4.3 >Reporter: YE >Assignee: YE >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > InternalRowComparableWrapper recreates row ordering for each output partition > when SPJ is enabled. The row ordering is generated via codegen which is quite > expensive and the output partitions might be quite large for production table > such as hundreds of thousands partitions. We encountered this issue when > applying SPJ with multiple large Iceberg tables and the plan phase took tens > of minutes to complete. > Attaching a screenshot to provide related stack trace: > !screenshot-1.png! > A simple fix for this would be caching the rowOrdering for > InternalRowComparableWrapper as the datatype of the InternalRow is immutable -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal
[ https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-47094. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45267 [https://github.com/apache/spark/pull/45267] > SPJ : Dynamically rebalance number of buckets when they are not equal > - > > Key: SPARK-47094 > URL: https://issues.apache.org/jira/browse/SPARK-47094 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0, 3.4.0 >Reporter: Himadri Pal >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPJ: Storage Partition Join works with Iceberg tables when both the tables > have same number of buckets. As part of this feature request, we would like > spark to gather the number of buckets information from both the tables and > dynamically rebalance the number of buckets by coalesce or repartition so > that SPJ will work fine. In this case, we would still have to shuffle but > would be better than no SPJ. > Use Case : > Many times we do not have control of the input tables, hence it's not > possible to change partitioning scheme on those tables. As a consumer, we > would still like them to be used with SPJ when used with other tables and > output tables which has different number of buckets. > In these scenario, we would need to read those tables rewrite them with > matching number of buckets for the SPJ to work, this extra step could > outweigh the benefits of less shuffle via SPJ. Also when there are multiple > different tables being joined, each tables need to be rewritten with matching > number of buckets. > If this feature is implemented, SPJ functionality will be more powerful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42040. -- Fix Version/s: 4.0.0 Resolution: Fixed > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42040: Assignee: Qi Zhu (was: zhuqi) > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size
[ https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42040: Assignee: zhuqi > SPJ: Introduce a new API for V2 input partition to report partition size > > > Key: SPARK-42040 > URL: https://issues.apache.org/jira/browse/SPARK-42040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: zhuqi >Priority: Major > Labels: pull-request-available > > It's useful for a {{InputPartition}} to also report its size (in bytes), so > that Spark can use the info to decide whether partition grouping should be > applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45731. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43629 [https://github.com/apache/spark/pull/43629] > Update partition statistics with ANALYZE TABLE command > -- > > Key: SPARK-45731 > URL: https://issues.apache.org/jira/browse/SPARK-45731 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently {{ANALYZE TABLE}} command only updates table-level stats but not > partition stats, even though it can be applied to both non-partitioned and > partitioned tables. It seems make sense for it to update partition stats as > well. > Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, > but the syntax is more verbose as they need to specify all the partition > columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45731: Assignee: Chao Sun > Update partition statistics with ANALYZE TABLE command > -- > > Key: SPARK-45731 > URL: https://issues.apache.org/jira/browse/SPARK-45731 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > > Currently {{ANALYZE TABLE}} command only updates table-level stats but not > partition stats, even though it can be applied to both non-partitioned and > partitioned tables. It seems make sense for it to update partition stats as > well. > Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, > but the syntax is more verbose as they need to specify all the partition > columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45846) spark.sql.optimizeNullAwareAntiJoin should respect spark.sql.autoBroadcastJoinThreshold
Chao Sun created SPARK-45846: Summary: spark.sql.optimizeNullAwareAntiJoin should respect spark.sql.autoBroadcastJoinThreshold Key: SPARK-45846 URL: https://issues.apache.org/jira/browse/SPARK-45846 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Chao Sun Normally broadcast join can be disabled when users set {{spark.sql.autoBroadcastJoinThreshold}} to -1. However this doesn't apply to {{spark.sql.optimizeNullAwareAntiJoin}}: {code} case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) => Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, BuildRight, None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = true)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
Chao Sun created SPARK-45731: Summary: Update partition statistics with ANALYZE TABLE command Key: SPARK-45731 URL: https://issues.apache.org/jira/browse/SPARK-45731 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Chao Sun Currently {{ANALYZE TABLE}} command only updates table-level stats but not partition stats, even though it can be applied to both non-partitioned and partitioned tables. It seems make sense for it to update partition stats as well. Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, but the syntax is more verbose as they need to specify all the partition columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering
[ https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-45652: - Description: When the number of input partitions become 0 after dynamic filtering, in {{BatchScanExec}}, currently SPJ will fail with error: {code} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) {code} This is because {{groupPartitions}} will return {{None}} for this case. was: When the number of input partitions become 0 after dynamic filtering, in {{BatchScanExec}}, currently SPJ will fail with error: {code} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218) at org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) {code} This is because {{groupPartitions}} will return {{None}} for this case. > SPJ: Handle empty input partitions after dynamic filtering > -- > > Key: SPARK-45652 > URL: https://issues.apache.org/jira/browse/SPARK-45652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > When the number of input partitions become 0 after dynamic filtering, in > {{BatchScanExec}}, currently SPJ will fail with error: > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) > {code} > This is because {{groupPartitions}} will return {{None}} for this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45678) Cover BufferReleasingInputStream.available under tryOrFetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-45678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45678: Assignee: L. C. Hsieh > Cover BufferReleasingInputStream.available under tryOrFetchFailedException > -- > > Key: SPARK-45678 > URL: https://issues.apache.org/jira/browse/SPARK-45678 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Labels: pull-request-available > > We have encountered shuffle data corruption issue: > ``` > Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:543) > at > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450) > at > org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497) > at > org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356) > ``` > Spark shuffle has capacity to detect corruption for a few stream op like > `read` and `skip`, such `IOException` in the stack trace will be rethrown as > `FetchFailedException` that will re-try the failed shuffle task. But in the > stack trace it is `available` that is not covered by the mechanism. So > no-retry has been happened and the Spark application just failed. > As the `available` op will also involve data decompression, we should be able > to check it like `read` and `skip` do. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45678) Cover BufferReleasingInputStream.available under tryOrFetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-45678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45678. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43543 [https://github.com/apache/spark/pull/43543] > Cover BufferReleasingInputStream.available under tryOrFetchFailedException > -- > > Key: SPARK-45678 > URL: https://issues.apache.org/jira/browse/SPARK-45678 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > We have encountered shuffle data corruption issue: > ``` > Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:543) > at > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450) > at > org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497) > at > org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356) > ``` > Spark shuffle has capacity to detect corruption for a few stream op like > `read` and `skip`, such `IOException` in the stack trace will be rethrown as > `FetchFailedException` that will re-try the failed shuffle task. But in the > stack trace it is `available` that is not covered by the mechanism. So > no-retry has been happened and the Spark application just failed. > As the `available` op will also involve data decompression, we should be able > to check it like `read` and `skip` do. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering
[ https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45652. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43531 [https://github.com/apache/spark/pull/43531] > SPJ: Handle empty input partitions after dynamic filtering > -- > > Key: SPARK-45652 > URL: https://issues.apache.org/jira/browse/SPARK-45652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When the number of input partitions become 0 after dynamic filtering, in > {{BatchScanExec}}, currently SPJ will fail with error: > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) > at > org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) > at > org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28) > at > org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) > at > org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > {code} > This is because {{groupPartitions}} will return {{None}} for this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering
[ https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45652: Assignee: Chao Sun > SPJ: Handle empty input partitions after dynamic filtering > -- > > Key: SPARK-45652 > URL: https://issues.apache.org/jira/browse/SPARK-45652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > > When the number of input partitions become 0 after dynamic filtering, in > {{BatchScanExec}}, currently SPJ will fail with error: > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) > at > org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) > at > org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28) > at > org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) > at > org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > {code} > This is because {{groupPartitions}} will return {{None}} for this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering
Chao Sun created SPARK-45652: Summary: SPJ: Handle empty input partitions after dynamic filtering Key: SPARK-45652 URL: https://issues.apache.org/jira/browse/SPARK-45652 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.1 Reporter: Chao Sun When the number of input partitions become 0 after dynamic filtering, in {{BatchScanExec}}, currently SPJ will fail with error: {code} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28) at org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218) at org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) {code} This is because {{groupPartitions}} will return {{None}} for this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method
[ https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-44913: Assignee: Xianyang Liu > DS V2 supports push down V2 UDF that has magic method > - > > Key: SPARK-44913 > URL: https://issues.apache.org/jira/browse/SPARK-44913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > Labels: pull-request-available > > Right now we only support pushing down the V2 UDF that has not a magic > method. Because the V2 UDF will be analyzed into the > `ApplyFunctionExpression` which could be translated and pushed down. However, > a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or > `Invoke` that can not be translated into V2 expression and then can not be > pushed down to the data source. The magic method is suggested. So this PR > adds the support of pushing down the V2 UDF that has a magic method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method
[ https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-44913. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42612 [https://github.com/apache/spark/pull/42612] > DS V2 supports push down V2 UDF that has magic method > - > > Key: SPARK-44913 > URL: https://issues.apache.org/jira/browse/SPARK-44913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Right now we only support pushing down the V2 UDF that has not a magic > method. Because the V2 UDF will be analyzed into the > `ApplyFunctionExpression` which could be translated and pushed down. However, > a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or > `Invoke` that can not be translated into V2 expression and then can not be > pushed down to the data source. The magic method is suggested. So this PR > adds the support of pushing down the V2 UDF that has a magic method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45365) Allow the daily tests of branch-3.4 to use the new test group tags
[ https://issues.apache.org/jira/browse/SPARK-45365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45365: Assignee: Yang Jie > Allow the daily tests of branch-3.4 to use the new test group tags > -- > > Key: SPARK-45365 > URL: https://issues.apache.org/jira/browse/SPARK-45365 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45365) Allow the daily tests of branch-3.4 to use the new test group tags
[ https://issues.apache.org/jira/browse/SPARK-45365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45365. -- Fix Version/s: 4.0.0 Resolution: Fixed > Allow the daily tests of branch-3.4 to use the new test group tags > -- > > Key: SPARK-45365 > URL: https://issues.apache.org/jira/browse/SPARK-45365 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36695) Allow passing V2 functions to data sources via V2 filters
[ https://issues.apache.org/jira/browse/SPARK-36695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-36695. -- Resolution: Duplicate > Allow passing V2 functions to data sources via V2 filters > - > > Key: SPARK-36695 > URL: https://issues.apache.org/jira/browse/SPARK-36695 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > The V2 filter API currently only allow {{NamedReference}} in predicates that > are pushed down to data sources. It may be beneficial to allow V2 functions > in predicates as well so that we can implement function pushdown. This > feature is also supported by Trino (Presto). > One use case is we can pushdown predicates such as {{bucket(col, 32) = 10}} > which will allow data sources such as Iceberg to only scan a single partition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44647) SPJ: Support SPJ when join key is subset of partition keys
[ https://issues.apache.org/jira/browse/SPARK-44647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-44647: - Summary: SPJ: Support SPJ when join key is subset of partition keys (was: Support SPJ when join key is subset of partition keys) > SPJ: Support SPJ when join key is subset of partition keys > -- > > Key: SPARK-44647 > URL: https://issues.apache.org/jira/browse/SPARK-44647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats
[ https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-45054: - Affects Version/s: 3.4.1 3.3.2 3.2.4 (was: 3.5.0) > HiveExternalCatalog.listPartitions should restore Spark SQL stats > - > > Key: SPARK-45054 > URL: https://issues.apache.org/jira/browse/SPARK-45054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.4, 3.3.2, 3.4.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0 > > > If partitions are stored in HMS with Spark populated stats such as > {{spark.sql.statistics.totalSize}}, currently > {{HiveExternalCatalog.listPartitions}} doesn't call > {{restorePartitionMetadata}} to restore those stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats
[ https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-45054: - Fix Version/s: 3.4.2 3.5.0 > HiveExternalCatalog.listPartitions should restore Spark SQL stats > - > > Key: SPARK-45054 > URL: https://issues.apache.org/jira/browse/SPARK-45054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0 > > > If partitions are stored in HMS with Spark populated stats such as > {{spark.sql.statistics.totalSize}}, currently > {{HiveExternalCatalog.listPartitions}} doesn't call > {{restorePartitionMetadata}} to restore those stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats
[ https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45054. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42777 [https://github.com/apache/spark/pull/42777] > HiveExternalCatalog.listPartitions should restore Spark SQL stats > - > > Key: SPARK-45054 > URL: https://issues.apache.org/jira/browse/SPARK-45054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 4.0.0 > > > If partitions are stored in HMS with Spark populated stats such as > {{spark.sql.statistics.totalSize}}, currently > {{HiveExternalCatalog.listPartitions}} doesn't call > {{restorePartitionMetadata}} to restore those stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats
[ https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45054: Assignee: Chao Sun > HiveExternalCatalog.listPartitions should restore Spark SQL stats > - > > Key: SPARK-45054 > URL: https://issues.apache.org/jira/browse/SPARK-45054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > If partitions are stored in HMS with Spark populated stats such as > {{spark.sql.statistics.totalSize}}, currently > {{HiveExternalCatalog.listPartitions}} doesn't call > {{restorePartitionMetadata}} to restore those stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats
Chao Sun created SPARK-45054: Summary: HiveExternalCatalog.listPartitions should restore Spark SQL stats Key: SPARK-45054 URL: https://issues.apache.org/jira/browse/SPARK-45054 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Chao Sun If partitions are stored in HMS with Spark populated stats such as {{spark.sql.statistics.totalSize}}, currently {{HiveExternalCatalog.listPartitions}} doesn't call {{restorePartitionMetadata}} to restore those stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45036) SPJ: Refactor logic to handle partially clustered distribution
Chao Sun created SPARK-45036: Summary: SPJ: Refactor logic to handle partially clustered distribution Key: SPARK-45036 URL: https://issues.apache.org/jira/browse/SPARK-45036 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Chao Sun The current logic in handling partially clustered distribution is a bit complicated. This JIRA proposes to simplify it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41471) SPJ: Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning
[ https://issues.apache.org/jira/browse/SPARK-41471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-41471: Assignee: Jia Fan > SPJ: Reduce Spark shuffle when only one side of a join is > KeyGroupedPartitioning > > > Key: SPARK-41471 > URL: https://issues.apache.org/jira/browse/SPARK-41471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Jia Fan >Priority: Major > Fix For: 4.0.0 > > > When only one side of a SPJ (Storage-Partitioned Join) is > {{{}KeyGroupedPartitioning{}}}, Spark currently needs to shuffle both sides > using {{{}HashPartitioning{}}}. However, we may just need to shuffle the > other side according to the partition transforms defined in > {{{}KeyGroupedPartitioning{}}}. This is especially useful when the other side > is relatively small. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41471) SPJ: Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning
[ https://issues.apache.org/jira/browse/SPARK-41471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-41471. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42194 [https://github.com/apache/spark/pull/42194] > SPJ: Reduce Spark shuffle when only one side of a join is > KeyGroupedPartitioning > > > Key: SPARK-41471 > URL: https://issues.apache.org/jira/browse/SPARK-41471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > Fix For: 4.0.0 > > > When only one side of a SPJ (Storage-Partitioned Join) is > {{{}KeyGroupedPartitioning{}}}, Spark currently needs to shuffle both sides > using {{{}HashPartitioning{}}}. However, we may just need to shuffle the > other side according to the partition transforms defined in > {{{}KeyGroupedPartitioning{}}}. This is especially useful when the other side > is relatively small. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
[ https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-44641. -- Fix Version/s: 3.4.2 3.5.0 Assignee: Chao Sun Resolution: Fixed > SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but > conditions unmet > -- > > Key: SPARK-44641 > URL: https://issues.apache.org/jira/browse/SPARK-44641 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: Szehon Ho >Assignee: Chao Sun >Priority: Blocker > Fix For: 3.4.2, 3.5.0 > > > Adding the following test case in KeyGroupedPartitionSuite demonstrates the > problem. > > {code:java} > test("test join key is the second partition key and a transform") { > val items_partitions = Array(bucket(8, "id"), days("arrive_time")) > createTable(items, items_schema, items_partitions) > sql(s"INSERT INTO testcat.ns.$items VALUES " + > s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + > s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " + > s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " + > s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " + > s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))") > val purchases_partitions = Array(bucket(8, "item_id"), days("time")) > createTable(purchases, purchases_schema, purchases_partitions) > sql(s"INSERT INTO testcat.ns.$purchases VALUES " + > s"(1, 42.0, cast('2020-01-01' as timestamp)), " + > s"(1, 44.0, cast('2020-01-15' as timestamp)), " + > s"(1, 45.0, cast('2020-01-15' as timestamp)), " + > s"(2, 11.0, cast('2020-01-01' as timestamp)), " + > s"(3, 19.5, cast('2020-02-01' as timestamp))") > withSQLConf( > SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false", > SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", > SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key -> > "true") { > val df = sql("SELECT id, name, i.price as purchase_price, " + > "p.item_id, p.price as sale_price " + > s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " + > "ON i.arrive_time = p.time " + > "ORDER BY id, purchase_price, p.item_id, sale_price") > val shuffles = collectShuffles(df.queryExecution.executedPlan) > assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys > are partition keys") > checkAnswer(df, > Seq( > Row(1, "aa", 40.0, 1, 42.0), > Row(1, "aa", 40.0, 2, 11.0), > Row(1, "aa", 41.0, 1, 44.0), > Row(1, "aa", 41.0, 1, 45.0), > Row(2, "bb", 10.0, 1, 42.0), > Row(2, "bb", 10.0, 2, 11.0), > Row(2, "bb", 10.5, 1, 42.0), > Row(2, "bb", 10.5, 2, 11.0), > Row(3, "cc", 15.5, 3, 19.5) > ) > ) > } > }{code} > > Note: this tests has setup the datasourceV2 to return multiple splits for > same partition. > In this case, SPJ is not triggered (because join key does not match partition > key), but the following code in DSV2Scan: > [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194] > intended to fill the empty partition for 'pushdown-vallue' will still iterate > through non-grouped partition and lookup from grouped partition to fill the > map, resulting in some duplicate input data fed into the join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44660) Relax constraint for columnar shuffle check in AQE
[ https://issues.apache.org/jira/browse/SPARK-44660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750881#comment-17750881 ] Chao Sun commented on SPARK-44660: -- In fact the check is necessary, but it seems {code} postStageCreationRules(outputsColumnar = plan.supportsColumnar) {code} can be relaxed: if the new shuffle operator supports columnar, then maybe we shouldn't insert {{ColumnarToRow}} to this stage. This is assuming the following stage knows the shuffle output is columnar and has corresponding {{ColumnarToRow}} if necessary. > Relax constraint for columnar shuffle check in AQE > -- > > Key: SPARK-44660 > URL: https://issues.apache.org/jira/browse/SPARK-44660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Chao Sun >Priority: Major > > Currently in AQE, after evaluating the columnar rules, Spark will check if > the top operator of the stage is still a shuffle operator, and throw > exception if it doesn't. > {code} > val optimized = e.withNewChildren(Seq(optimizeQueryStage(e.child, > isFinalStage = false))) > val newPlan = applyPhysicalRules( > optimized, > postStageCreationRules(outputsColumnar = plan.supportsColumnar), > Some((planChangeLogger, "AQE Post Stage Creation"))) > if (e.isInstanceOf[ShuffleExchangeLike]) { > if (!newPlan.isInstanceOf[ShuffleExchangeLike]) { > throw SparkException.internalError( > "Custom columnar rules cannot transform shuffle node to > something else.") > } > {code} > However, once a shuffle operator is transformed into a custom columnar > shuffle operator, the {{supportsColumnar}} of the new shuffle operator will > return true, and therefore the columnar rules will insert {{ColumnarToRow}} > on top of it. This means the {{newPlan}} is likely no longer a > {{ShuffleExchangeLike}} but a {{ColumnarToRow}}, and exception will be > thrown, even though the use case is valid. > This JIRA proposes to relax the check by allowing the above case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
[ https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-44641: - Priority: Blocker (was: Major) > SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but > conditions unmet > -- > > Key: SPARK-44641 > URL: https://issues.apache.org/jira/browse/SPARK-44641 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Szehon Ho >Priority: Blocker > > Adding the following test case in KeyGroupedPartitionSuite demonstrates the > problem. > > {code:java} > test("test join key is the second partition key and a transform") { > val items_partitions = Array(bucket(8, "id"), days("arrive_time")) > createTable(items, items_schema, items_partitions) > sql(s"INSERT INTO testcat.ns.$items VALUES " + > s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + > s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " + > s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " + > s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " + > s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))") > val purchases_partitions = Array(bucket(8, "item_id"), days("time")) > createTable(purchases, purchases_schema, purchases_partitions) > sql(s"INSERT INTO testcat.ns.$purchases VALUES " + > s"(1, 42.0, cast('2020-01-01' as timestamp)), " + > s"(1, 44.0, cast('2020-01-15' as timestamp)), " + > s"(1, 45.0, cast('2020-01-15' as timestamp)), " + > s"(2, 11.0, cast('2020-01-01' as timestamp)), " + > s"(3, 19.5, cast('2020-02-01' as timestamp))") > withSQLConf( > SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false", > SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", > SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key -> > "true") { > val df = sql("SELECT id, name, i.price as purchase_price, " + > "p.item_id, p.price as sale_price " + > s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " + > "ON i.arrive_time = p.time " + > "ORDER BY id, purchase_price, p.item_id, sale_price") > val shuffles = collectShuffles(df.queryExecution.executedPlan) > assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys > are partition keys") > checkAnswer(df, > Seq( > Row(1, "aa", 40.0, 1, 42.0), > Row(1, "aa", 40.0, 2, 11.0), > Row(1, "aa", 41.0, 1, 44.0), > Row(1, "aa", 41.0, 1, 45.0), > Row(2, "bb", 10.0, 1, 42.0), > Row(2, "bb", 10.0, 2, 11.0), > Row(2, "bb", 10.5, 1, 42.0), > Row(2, "bb", 10.5, 2, 11.0), > Row(3, "cc", 15.5, 3, 19.5) > ) > ) > } > }{code} > > Note: this tests has setup the datasourceV2 to return multiple splits for > same partition. > In this case, SPJ is not triggered (because join key does not match partition > key), but the following code in DSV2Scan: > [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194] > intended to fill the empty partition for 'pushdown-vallue' will still iterate > through non-grouped partition and lookup from grouped partition to fill the > map, resulting in some duplicate input data fed into the join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44660) Relax constraint for columnar shuffle check in AQE
Chao Sun created SPARK-44660: Summary: Relax constraint for columnar shuffle check in AQE Key: SPARK-44660 URL: https://issues.apache.org/jira/browse/SPARK-44660 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: Chao Sun Currently in AQE, after evaluating the columnar rules, Spark will check if the top operator of the stage is still a shuffle operator, and throw exception if it doesn't. {code} val optimized = e.withNewChildren(Seq(optimizeQueryStage(e.child, isFinalStage = false))) val newPlan = applyPhysicalRules( optimized, postStageCreationRules(outputsColumnar = plan.supportsColumnar), Some((planChangeLogger, "AQE Post Stage Creation"))) if (e.isInstanceOf[ShuffleExchangeLike]) { if (!newPlan.isInstanceOf[ShuffleExchangeLike]) { throw SparkException.internalError( "Custom columnar rules cannot transform shuffle node to something else.") } {code} However, once a shuffle operator is transformed into a custom columnar shuffle operator, the {{supportsColumnar}} of the new shuffle operator will return true, and therefore the columnar rules will insert {{ColumnarToRow}} on top of it. This means the {{newPlan}} is likely no longer a {{ShuffleExchangeLike}} but a {{ColumnarToRow}}, and exception will be thrown, even though the use case is valid. This JIRA proposes to relax the check by allowing the above case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44659) SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check
[ https://issues.apache.org/jira/browse/SPARK-44659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-44659: - Summary: SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check (was: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check) > SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality > check > > > Key: SPARK-44659 > URL: https://issues.apache.org/jira/browse/SPARK-44659 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Priority: Minor > > Currently {{StoragePartitionJoinParams}} doesn't include > {{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. > For completeness, we should include it as well since it is a member of the > class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
[ https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-44641: - Summary: SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet (was: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet) > SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but > conditions unmet > -- > > Key: SPARK-44641 > URL: https://issues.apache.org/jira/browse/SPARK-44641 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Szehon Ho >Priority: Major > > Adding the following test case in KeyGroupedPartitionSuite demonstrates the > problem. > > {code:java} > test("test join key is the second partition key and a transform") { > val items_partitions = Array(bucket(8, "id"), days("arrive_time")) > createTable(items, items_schema, items_partitions) > sql(s"INSERT INTO testcat.ns.$items VALUES " + > s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + > s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " + > s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " + > s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " + > s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))") > val purchases_partitions = Array(bucket(8, "item_id"), days("time")) > createTable(purchases, purchases_schema, purchases_partitions) > sql(s"INSERT INTO testcat.ns.$purchases VALUES " + > s"(1, 42.0, cast('2020-01-01' as timestamp)), " + > s"(1, 44.0, cast('2020-01-15' as timestamp)), " + > s"(1, 45.0, cast('2020-01-15' as timestamp)), " + > s"(2, 11.0, cast('2020-01-01' as timestamp)), " + > s"(3, 19.5, cast('2020-02-01' as timestamp))") > withSQLConf( > SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false", > SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", > SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key -> > "true") { > val df = sql("SELECT id, name, i.price as purchase_price, " + > "p.item_id, p.price as sale_price " + > s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " + > "ON i.arrive_time = p.time " + > "ORDER BY id, purchase_price, p.item_id, sale_price") > val shuffles = collectShuffles(df.queryExecution.executedPlan) > assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys > are partition keys") > checkAnswer(df, > Seq( > Row(1, "aa", 40.0, 1, 42.0), > Row(1, "aa", 40.0, 2, 11.0), > Row(1, "aa", 41.0, 1, 44.0), > Row(1, "aa", 41.0, 1, 45.0), > Row(2, "bb", 10.0, 1, 42.0), > Row(2, "bb", 10.0, 2, 11.0), > Row(2, "bb", 10.5, 1, 42.0), > Row(2, "bb", 10.5, 2, 11.0), > Row(3, "cc", 15.5, 3, 19.5) > ) > ) > } > }{code} > > Note: this tests has setup the datasourceV2 to return multiple splits for > same partition. > In this case, SPJ is not triggered (because join key does not match partition > key), but the following code in DSV2Scan: > [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194] > intended to fill the empty partition for 'pushdown-vallue' will still iterate > through non-grouped partition and lookup from grouped partition to fill the > map, resulting in some duplicate input data fed into the join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44659) Include keyGroupedPartitioning in StoragePartitionJoinParams equality check
Chao Sun created SPARK-44659: Summary: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check Key: SPARK-44659 URL: https://issues.apache.org/jira/browse/SPARK-44659 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Chao Sun Currently {{StoragePartitionJoinParams}} doesn't include {{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. For completeness, we should include it as well since it is a member of the class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44641) Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
[ https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-44641: - Parent: SPARK-37375 Issue Type: Sub-task (was: Bug) > Results duplicated when SPJ partial-cluster and pushdown enabled but > conditions unmet > - > > Key: SPARK-44641 > URL: https://issues.apache.org/jira/browse/SPARK-44641 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1 >Reporter: Szehon Ho >Priority: Major > > Adding the following test case in KeyGroupedPartitionSuite demonstrates the > problem. > > {code:java} > test("test join key is the second partition key and a transform") { > val items_partitions = Array(bucket(8, "id"), days("arrive_time")) > createTable(items, items_schema, items_partitions) > sql(s"INSERT INTO testcat.ns.$items VALUES " + > s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + > s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " + > s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " + > s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " + > s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))") > val purchases_partitions = Array(bucket(8, "item_id"), days("time")) > createTable(purchases, purchases_schema, purchases_partitions) > sql(s"INSERT INTO testcat.ns.$purchases VALUES " + > s"(1, 42.0, cast('2020-01-01' as timestamp)), " + > s"(1, 44.0, cast('2020-01-15' as timestamp)), " + > s"(1, 45.0, cast('2020-01-15' as timestamp)), " + > s"(2, 11.0, cast('2020-01-01' as timestamp)), " + > s"(3, 19.5, cast('2020-02-01' as timestamp))") > withSQLConf( > SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false", > SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", > SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key -> > "true") { > val df = sql("SELECT id, name, i.price as purchase_price, " + > "p.item_id, p.price as sale_price " + > s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " + > "ON i.arrive_time = p.time " + > "ORDER BY id, purchase_price, p.item_id, sale_price") > val shuffles = collectShuffles(df.queryExecution.executedPlan) > assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys > are partition keys") > checkAnswer(df, > Seq( > Row(1, "aa", 40.0, 1, 42.0), > Row(1, "aa", 40.0, 2, 11.0), > Row(1, "aa", 41.0, 1, 44.0), > Row(1, "aa", 41.0, 1, 45.0), > Row(2, "bb", 10.0, 1, 42.0), > Row(2, "bb", 10.0, 2, 11.0), > Row(2, "bb", 10.5, 1, 42.0), > Row(2, "bb", 10.5, 2, 11.0), > Row(3, "cc", 15.5, 3, 19.5) > ) > ) > } > }{code} > > Note: this tests has setup the datasourceV2 to return multiple splits for > same partition. > In this case, SPJ is not triggered (because join key does not match partition > key), but the following code in DSV2Scan: > [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194] > intended to fill the empty partition for 'pushdown-vallue' will still iterate > through non-grouped partition and lookup from grouped partition to fill the > map, resulting in some duplicate input data fed into the join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec
[ https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-42454: - Fix Version/s: 3.5.0 (was: 4.0.0) > SPJ: encapsulate all SPJ related parameters in BatchScanExec > > > Key: SPARK-42454 > URL: https://issues.apache.org/jira/browse/SPARK-42454 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Szehon Ho >Priority: Minor > Fix For: 3.5.0 > > > The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is > annoying since there are many places which do pattern-matching on > {{BatchScanExec}} and they have to change accordingly. > To make this less disruptive, we can introduce a struct for all the SPJ > classes and use that as the parameter for {{BatchScanExec}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec
[ https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42454. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41990 [https://github.com/apache/spark/pull/41990] > SPJ: encapsulate all SPJ related parameters in BatchScanExec > > > Key: SPARK-42454 > URL: https://issues.apache.org/jira/browse/SPARK-42454 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Minor > Fix For: 4.0.0 > > > The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is > annoying since there are many places which do pattern-matching on > {{BatchScanExec}} and they have to change accordingly. > To make this less disruptive, we can introduce a struct for all the SPJ > classes and use that as the parameter for {{BatchScanExec}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec
[ https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42454: Assignee: Szehon Ho > SPJ: encapsulate all SPJ related parameters in BatchScanExec > > > Key: SPARK-42454 > URL: https://issues.apache.org/jira/browse/SPARK-42454 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Szehon Ho >Priority: Minor > Fix For: 4.0.0 > > > The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is > annoying since there are many places which do pattern-matching on > {{BatchScanExec}} and they have to change accordingly. > To make this less disruptive, we can introduce a struct for all the SPJ > classes and use that as the parameter for {{BatchScanExec}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-36612. -- Fix Version/s: 3.5.0 Resolution: Fixed > Support left outer join build left or right outer join build right in > shuffled hash join > > > Key: SPARK-36612 > URL: https://issues.apache.org/jira/browse/SPARK-36612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: Szehon Ho >Priority: Major > Fix For: 3.5.0 > > > Currently spark sql does not support build left side when left outer join (or > build right side when right outer join). > However, in our production environment, there are a large number of scenarios > where small tables are left join large tables, and many times, large tables > have data skew (currently AQE can't handle this kind of skew). > Inspired by SPARK-32399, we can use similar ideas to realize left outer join > build left. > I think this treatment is very meaningful, but I don’t know how members > consider this matter? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-36612: Assignee: Szehon Ho > Support left outer join build left or right outer join build right in > shuffled hash join > > > Key: SPARK-36612 > URL: https://issues.apache.org/jira/browse/SPARK-36612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: Szehon Ho >Priority: Major > > Currently spark sql does not support build left side when left outer join (or > build right side when right outer join). > However, in our production environment, there are a large number of scenarios > where small tables are left join large tables, and many times, large tables > have data skew (currently AQE can't handle this kind of skew). > Inspired by SPARK-32399, we can use similar ideas to realize left outer join > build left. > I think this treatment is very meaningful, but I don’t know how members > consider this matter? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43758) Upgrade snappy-java to 1.1.10.0
[ https://issues.apache.org/jira/browse/SPARK-43758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-43758: - Issue Type: Bug (was: Improvement) > Upgrade snappy-java to 1.1.10.0 > --- > > Key: SPARK-43758 > URL: https://issues.apache.org/jira/browse/SPARK-43758 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Chao Sun >Priority: Major > > Update {{snappy-java}} to 1.1.10.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43758) Upgrade snappy-java to 1.1.10.0
[ https://issues.apache.org/jira/browse/SPARK-43758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-43758: - Affects Version/s: 3.4.0 (was: 3.5.0) > Upgrade snappy-java to 1.1.10.0 > --- > > Key: SPARK-43758 > URL: https://issues.apache.org/jira/browse/SPARK-43758 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Chao Sun >Priority: Major > > Update {{snappy-java}} to 1.1.10.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43758) Upgrade snappy-java to 1.1.10.0
Chao Sun created SPARK-43758: Summary: Upgrade snappy-java to 1.1.10.0 Key: SPARK-43758 URL: https://issues.apache.org/jira/browse/SPARK-43758 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Chao Sun Update {{snappy-java}} to 1.1.10.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43494: Assignee: Yang Jie > Directly call `replicate()` instead of reflection in > `SparkHadoopUtil#createFile` > - > > Key: SPARK-43494 > URL: https://issues.apache.org/jira/browse/SPARK-43494 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43494. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41164 [https://github.com/apache/spark/pull/41164] > Directly call `replicate()` instead of reflection in > `SparkHadoopUtil#createFile` > - > > Key: SPARK-43494 > URL: https://issues.apache.org/jira/browse/SPARK-43494 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43272: Assignee: Yang Jie > Replace reflection w/ direct calling for `SparkHadoopUtil#createFile` > -- > > Key: SPARK-43272 > URL: https://issues.apache.org/jira/browse/SPARK-43272 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43272. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40945 [https://github.com/apache/spark/pull/40945] > Replace reflection w/ direct calling for `SparkHadoopUtil#createFile` > -- > > Key: SPARK-43272 > URL: https://issues.apache.org/jira/browse/SPARK-43272 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43484) Kafka/Kinesis Assembly should not package hadoop-client-runtime
[ https://issues.apache.org/jira/browse/SPARK-43484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43484. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41152 [https://github.com/apache/spark/pull/41152] > Kafka/Kinesis Assembly should not package hadoop-client-runtime > --- > > Key: SPARK-43484 > URL: https://issues.apache.org/jira/browse/SPARK-43484 > Project: Spark > Issue Type: Bug > Components: Build, Structured Streaming >Affects Versions: 3.2.4, 3.3.2, 3.4.0, 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43484) Kafka/Kinesis Assembly should not package hadoop-client-runtime
[ https://issues.apache.org/jira/browse/SPARK-43484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43484: Assignee: Cheng Pan > Kafka/Kinesis Assembly should not package hadoop-client-runtime > --- > > Key: SPARK-43484 > URL: https://issues.apache.org/jira/browse/SPARK-43484 > Project: Spark > Issue Type: Bug > Components: Build, Structured Streaming >Affects Versions: 3.2.4, 3.3.2, 3.4.0, 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43410: Assignee: xiaochen zhou > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Assignee: xiaochen zhou >Priority: Minor > Fix For: 3.5.0 > > > Improve vectorized loop for Packed skipValues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43410. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41092 [https://github.com/apache/spark/pull/41092] > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Priority: Minor > Fix For: 3.5.0 > > > Improve vectorized loop for Packed skipValues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43248) Unnecessary serialize/deserialize of Path on parallel gather partition stats
[ https://issues.apache.org/jira/browse/SPARK-43248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43248. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40920 [https://github.com/apache/spark/pull/40920] > Unnecessary serialize/deserialize of Path on parallel gather partition stats > > > Key: SPARK-43248 > URL: https://issues.apache.org/jira/browse/SPARK-43248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43248) Unnecessary serialize/deserialize of Path on parallel gather partition stats
[ https://issues.apache.org/jira/browse/SPARK-43248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43248: Assignee: Cheng Pan > Unnecessary serialize/deserialize of Path on parallel gather partition stats > > > Key: SPARK-43248 > URL: https://issues.apache.org/jira/browse/SPARK-43248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43268) Use proper error classes when exceptions are constructed with a message
[ https://issues.apache.org/jira/browse/SPARK-43268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43268: Assignee: Anton Okolnychyi > Use proper error classes when exceptions are constructed with a message > --- > > Key: SPARK-43268 > URL: https://issues.apache.org/jira/browse/SPARK-43268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > As discussed > [here|https://github.com/apache/spark/pull/40679/files#r1159264585]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43268) Use proper error classes when exceptions are constructed with a message
[ https://issues.apache.org/jira/browse/SPARK-43268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43268. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40934 [https://github.com/apache/spark/pull/40934] > Use proper error classes when exceptions are constructed with a message > --- > > Key: SPARK-43268 > URL: https://issues.apache.org/jira/browse/SPARK-43268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > As discussed > [here|https://github.com/apache/spark/pull/40679/files#r1159264585]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-43211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43211. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40870 [https://github.com/apache/spark/pull/40870] > Remove Hadoop2 support in IsolatedClientLoader > -- > > Key: SPARK-43211 > URL: https://issues.apache.org/jira/browse/SPARK-43211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-43211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43211: Assignee: Cheng Pan > Remove Hadoop2 support in IsolatedClientLoader > -- > > Key: SPARK-43211 > URL: https://issues.apache.org/jira/browse/SPARK-43211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API
[ https://issues.apache.org/jira/browse/SPARK-43202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43202. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40860 [https://github.com/apache/spark/pull/40860] > Replace reflection w/ direct calling for YARN Resource API > -- > > Key: SPARK-43202 > URL: https://issues.apache.org/jira/browse/SPARK-43202 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API
[ https://issues.apache.org/jira/browse/SPARK-43202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43202: Assignee: Cheng Pan > Replace reflection w/ direct calling for YARN Resource API > -- > > Key: SPARK-43202 > URL: https://issues.apache.org/jira/browse/SPARK-43202 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading
[ https://issues.apache.org/jira/browse/SPARK-43208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43208: Assignee: Cheng Pan > IsolatedClassLoader should close barrier class InputStream after reading > > > Key: SPARK-43208 > URL: https://issues.apache.org/jira/browse/SPARK-43208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading
[ https://issues.apache.org/jira/browse/SPARK-43208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43208. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40867 [https://github.com/apache/spark/pull/40867] > IsolatedClassLoader should close barrier class InputStream after reading > > > Key: SPARK-43208 > URL: https://issues.apache.org/jira/browse/SPARK-43208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
[ https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43196: Assignee: Yang Jie > Replace reflection w/ direct calling for > `ContainerLaunchContext#setTokensConf` > --- > > Key: SPARK-43196 > URL: https://issues.apache.org/jira/browse/SPARK-43196 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
[ https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43196. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40855 [https://github.com/apache/spark/pull/40855] > Replace reflection w/ direct calling for > `ContainerLaunchContext#setTokensConf` > --- > > Key: SPARK-43196 > URL: https://issues.apache.org/jira/browse/SPARK-43196 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
[ https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43191. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40850 [https://github.com/apache/spark/pull/40850] > Replace reflection w/ direct calling for Hadoop CallerContext > -- > > Key: SPARK-43191 > URL: https://issues.apache.org/jira/browse/SPARK-43191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
[ https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43191: Assignee: Cheng Pan > Replace reflection w/ direct calling for Hadoop CallerContext > -- > > Key: SPARK-43191 > URL: https://issues.apache.org/jira/browse/SPARK-43191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43200) Remove Hadoop 2 reference in docs
[ https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43200. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40857 [https://github.com/apache/spark/pull/40857] > Remove Hadoop 2 reference in docs > - > > Key: SPARK-43200 > URL: https://issues.apache.org/jira/browse/SPARK-43200 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43200) Remove Hadoop 2 reference in docs
[ https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43200: Assignee: Cheng Pan > Remove Hadoop 2 reference in docs > - > > Key: SPARK-43200 > URL: https://issues.apache.org/jira/browse/SPARK-43200 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43195. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40854 [https://github.com/apache/spark/pull/40854] > Remove unnecessary serializable wrapper in HadoopFSUtils > > > Key: SPARK-43195 > URL: https://issues.apache.org/jira/browse/SPARK-43195 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43195: Assignee: Cheng Pan > Remove unnecessary serializable wrapper in HadoopFSUtils > > > Key: SPARK-43195 > URL: https://issues.apache.org/jira/browse/SPARK-43195 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43187. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40849 [https://github.com/apache/spark/pull/40849] > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43187: Assignee: Cheng Pan > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43186) Remove workaround for FileSinkDesc
[ https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43186. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40848 [https://github.com/apache/spark/pull/40848] > Remove workaround for FileSinkDesc > -- > > Key: SPARK-43186 > URL: https://issues.apache.org/jira/browse/SPARK-43186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43186) Remove workaround for FileSinkDesc
[ https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43186: Assignee: Cheng Pan > Remove workaround for FileSinkDesc > -- > > Key: SPARK-43186 > URL: https://issues.apache.org/jira/browse/SPARK-43186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42452) Remove hadoop-2 profile from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42452. -- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed > Remove hadoop-2 profile from Apache Spark > - > > Key: SPARK-42452 > URL: https://issues.apache.org/jira/browse/SPARK-42452 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > SPARK-40651 Drop Hadoop2 binary distribtuion from release process and > SPARK-42447 Remove Hadoop 2 GitHub Action job > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters in vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-42388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42388. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39950 [https://github.com/apache/spark/pull/39950] > Avoid unnecessary parquet footer reads when no filters in vectorized reader > --- > > Key: SPARK-42388 > URL: https://issues.apache.org/jira/browse/SPARK-42388 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Mars >Assignee: Mars >Priority: Major > Fix For: 3.5.0 > > > Parquet footer is now read twice even if there are no filters requiring > pushdown in vectorized parquet reader. > When the NameNode is under high pressure, it will cost time to read twice. > Actually we can avoid this unnecessary parquet footer reads and use footer > metadata in {{{}VectorizedParquetRecordReader{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters in vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-42388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42388: Assignee: Mars > Avoid unnecessary parquet footer reads when no filters in vectorized reader > --- > > Key: SPARK-42388 > URL: https://issues.apache.org/jira/browse/SPARK-42388 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Mars >Assignee: Mars >Priority: Major > > Parquet footer is now read twice even if there are no filters requiring > pushdown in vectorized parquet reader. > When the NameNode is under high pressure, it will cost time to read twice. > Actually we can avoid this unnecessary parquet footer reads and use footer > metadata in {{{}VectorizedParquetRecordReader{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43150) Remove workaround for PARQUET-2160
[ https://issues.apache.org/jira/browse/SPARK-43150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43150: Assignee: Cheng Pan > Remove workaround for PARQUET-2160 > -- > > Key: SPARK-43150 > URL: https://issues.apache.org/jira/browse/SPARK-43150 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43150) Remove workaround for PARQUET-2160
[ https://issues.apache.org/jira/browse/SPARK-43150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43150. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40802 [https://github.com/apache/spark/pull/40802] > Remove workaround for PARQUET-2160 > -- > > Key: SPARK-43150 > URL: https://issues.apache.org/jira/browse/SPARK-43150 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43064) Spark SQL CLI SQL tab should only show once statement once
[ https://issues.apache.org/jira/browse/SPARK-43064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43064. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40701 [https://github.com/apache/spark/pull/40701] > Spark SQL CLI SQL tab should only show once statement once > -- > > Key: SPARK-43064 > URL: https://issues.apache.org/jira/browse/SPARK-43064 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.5.0 > > Attachments: screenshot-1.png > > > !screenshot-1.png|width=996,height=554! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43064) Spark SQL CLI SQL tab should only show once statement once
[ https://issues.apache.org/jira/browse/SPARK-43064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43064: Assignee: angerszhu > Spark SQL CLI SQL tab should only show once statement once > -- > > Key: SPARK-43064 > URL: https://issues.apache.org/jira/browse/SPARK-43064 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png|width=996,height=554! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43104) Set `shadeTestJar` of protobuf module to false
[ https://issues.apache.org/jira/browse/SPARK-43104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43104: Assignee: Yang Jie > Set `shadeTestJar` of protobuf module to false > -- > > Key: SPARK-43104 > URL: https://issues.apache.org/jira/browse/SPARK-43104 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43104) Set `shadeTestJar` of protobuf module to false
[ https://issues.apache.org/jira/browse/SPARK-43104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43104. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40753 [https://github.com/apache/spark/pull/40753] > Set `shadeTestJar` of protobuf module to false > -- > > Key: SPARK-43104 > URL: https://issues.apache.org/jira/browse/SPARK-43104 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org