[jira] [Updated] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.

2024-08-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-49300:
-
Fix Version/s: 3.5.3

> Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
> --
>
> Key: SPARK-49300
> URL: https://issues.apache.org/jira/browse/SPARK-49300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> If tokenRenewalInterval is not set, 
> HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some 
> tokens and renew them to get a interval value. These tokens do not call 
> cancel(), resulting in a large number of existing tokens on HDFS not being 
> cleared in a timely manner, causing additional pressure on the HDFS server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.

2024-08-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-49300:


Assignee: Shuyan Zhang

> Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
> --
>
> Key: SPARK-49300
> URL: https://issues.apache.org/jira/browse/SPARK-49300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If tokenRenewalInterval is not set, 
> HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some 
> tokens and renew them to get a interval value. These tokens do not call 
> cancel(), resulting in a large number of existing tokens on HDFS not being 
> cleared in a timely manner, causing additional pressure on the HDFS server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49300) Fix Hadoop delegation token leak when tokenRenewalInterval is not set.

2024-08-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-49300.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47800
[https://github.com/apache/spark/pull/47800]

> Fix Hadoop delegation token leak when tokenRenewalInterval is not set.
> --
>
> Key: SPARK-49300
> URL: https://issues.apache.org/jira/browse/SPARK-49300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If tokenRenewalInterval is not set, 
> HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some 
> tokens and renew them to get a interval value. These tokens do not call 
> cancel(), resulting in a large number of existing tokens on HDFS not being 
> cleared in a timely manner, causing additional pressure on the HDFS server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49068) "Externally Managed Environment" error when building PySpark Docker image

2024-07-30 Thread Chao Sun (Jira)
Chao Sun created SPARK-49068:


 Summary: "Externally Managed Environment" error when building 
PySpark Docker image 
 Key: SPARK-49068
 URL: https://issues.apache.org/jira/browse/SPARK-49068
 Project: Spark
  Issue Type: Bug
  Components: Spark Docker
Affects Versions: 3.5.1
Reporter: Chao Sun


When trying to build Docker image based on PySpark Dockerfile in Ubuntu 20.04, 
I got the following error:
{code}
#7 19.13 error: externally-managed-environment
#7 19.13
#7 19.13 × This environment is externally managed
#7 19.13 ╰─> To install Python packages system-wide, try apt install
#7 19.13 python3-xyz, where xyz is the package you are trying to
#7 19.13 install.
#7 19.13
#7 19.13 If you wish to install a non-Debian-packaged Python package,
#7 19.13 create a virtual environment using python3 -m venv path/to/venv.
#7 19.13 Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
#7 19.13 sure you have python3-full installed.
#7 19.13
#7 19.13 If you wish to install a non-Debian packaged Python application,
#7 19.13 it may be easiest to use pipx install xyz, which will manage a
#7 19.13 virtual environment for you. Make sure you have pipx installed.
#7 19.13
#7 19.13 See /usr/share/doc/python3.12/README.venv for more information.
#7 19.13
#7 19.13 note: If you believe this is a mistake, please contact your Python 
installation or OS distribution provider. You can override this, at the risk of 
breaking your Python installation or OS, by passing --break-system-packages.
#7 19.13 hint: See PEP 668 for the detailed specification.
#7 ERROR: process "/bin/sh -c apt-get update && apt install -y python3 
python3-pip && rm -rf /usr/lib/python3.11/EXTERNALLY-MANAGED && pip3 
install --upgrade pip setuptools && rm -rf /root/.cache && rm -rf 
/var/cache/apt/* && rm -rf /var/lib/apt/lists/*" did not complete successfully: 
exit code: 1
{code}

Looking at the 
[Dockerfile|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile],
 it does the following:
{code}
RUN apt-get update && \
apt install -y python3 python3-pip && \
pip3 install --upgrade pip setuptools && \
# Removed the .cache to save space
rm -rf /root/.cache && rm -rf /var/cache/apt/* && rm -rf 
/var/lib/apt/lists/*
{code}

If {{pip}} was installed by the system package manager, and then we are trying 
to overwrite it via {{pip3 install}}, the error could happen.

A simple solution would be to create a virtual environment first, install the 
latest pip there, and then update {{PATH}} to use that instead. 

Wonder if anyone else has encountered the same issue and whether it is a good 
idea to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48613) Umbrella: Storage Partition Join Improvements

2024-07-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48613.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47064
[https://github.com/apache/spark/pull/47064]

> Umbrella: Storage Partition Join Improvements
> -
>
> Key: SPARK-48613
> URL: https://issues.apache.org/jira/browse/SPARK-48613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries

2024-06-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48655:


Assignee: Szehon Ho

> SPJ: Add tests for shuffle skipping for aggregate queries
> -
>
> Key: SPARK-48655
> URL: https://issues.apache.org/jira/browse/SPARK-48655
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle

2024-06-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48012:


Assignee: Szehon Ho

> SPJ: Support Transfrom Expressions for One Side Shuffle
> ---
>
> Key: SPARK-48012
> URL: https://issues.apache.org/jira/browse/SPARK-48012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if 
> the other side is KeyGroupedPartitioning.  However, the support was just for 
> a KeyGroupedPartition without any partition transform (day, year, bucket).  
> It will be useful to add support for partition transform as well, as there 
> are many tables partitioned by those transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle

2024-06-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48012.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46255
[https://github.com/apache/spark/pull/46255]

> SPJ: Support Transfrom Expressions for One Side Shuffle
> ---
>
> Key: SPARK-48012
> URL: https://issues.apache.org/jira/browse/SPARK-48012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if 
> the other side is KeyGroupedPartitioning.  However, the support was just for 
> a KeyGroupedPartition without any partition transform (day, year, bucket).  
> It will be useful to add support for partition transform as well, as there 
> are many tables partitioned by those transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48392) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-48392:
-
Summary: Load `spark-defaults.conf` when passing configurations to 
`spark-submit` through `--properties-file`  (was: (Optionally) Load 
`spark-defaults.conf` when passing configurations to `spark-submit` through 
`--properties-file`)

> Load `spark-defaults.conf` when passing configurations to `spark-submit` 
> through `--properties-file`
> 
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48392.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46709
[https://github.com/apache/spark/pull/46709]

> (Optionally) Load `spark-defaults.conf` when passing configurations to 
> `spark-submit` through `--properties-file`
> -
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48392:


Assignee: Chao Sun

> (Optionally) Load `spark-defaults.conf` when passing configurations to 
> `spark-submit` through `--properties-file`
> -
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-22 Thread Chao Sun (Jira)
Chao Sun created SPARK-48392:


 Summary: (Optionally) Load `spark-defaults.conf` when passing 
configurations to `spark-submit` through `--properties-file`
 Key: SPARK-48392
 URL: https://issues.apache.org/jira/browse/SPARK-48392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Chao Sun


Currently, when user pass configurations to `spark-submit.sh` via 
`--properties-file`, the `spark-defaults.conf` will be completely ignored. This 
poses issues for some people, for instance, those using [Spark on K8S operator 
from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-48392:
-
Description: 
Currently, when user pass configurations to {{spark-submit.sh}} via 
{{--properties-file}}, the {{spark-defaults.conf}} will be completely ignored. 
This poses issues for some people, for instance, those using [Spark on K8S 
operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related 
issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321

  was:
Currently, when user pass configurations to `spark-submit.sh` via 
`--properties-file`, the `spark-defaults.conf` will be completely ignored. This 
poses issues for some people, for instance, those using [Spark on K8S operator 
from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321


> (Optionally) Load `spark-defaults.conf` when passing configurations to 
> `spark-submit` through `--properties-file`
> -
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict

2024-05-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48065.
--
Resolution: Fixed

Issue resolved by pull request 46325
[https://github.com/apache/spark/pull/46325]

> SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
> -
>
> Key: SPARK-48065
> URL: https://issues.apache.org/jira/browse/SPARK-48065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, 
> then SPJ no longer triggers if there are more join keys than partition keys.  
> It is triggered only if join keys is equal to , or less than, partition keys.
>  
> We can relax this constraint, as this case was supported if the flag is not 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict

2024-05-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48065:


Assignee: Szehon Ho

> SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
> -
>
> Key: SPARK-48065
> URL: https://issues.apache.org/jira/browse/SPARK-48065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, 
> then SPJ no longer triggers if there are more join keys than partition keys.  
> It is triggered only if join keys is equal to , or less than, partition keys.
>  
> We can relax this constraint, as this case was supported if the flag is not 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48030) InternalRowComparableWrapper should cache rowOrdering to improve performace

2024-04-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48030.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46265
[https://github.com/apache/spark/pull/46265]

> InternalRowComparableWrapper should cache rowOrdering to improve performace
> ---
>
> Key: SPARK-48030
> URL: https://issues.apache.org/jira/browse/SPARK-48030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1, 3.4.3
>Reporter: YE
>Assignee: YE
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: screenshot-1.png
>
>
> InternalRowComparableWrapper recreates row ordering for each output partition 
> when SPJ is enabled. The row ordering is generated via codegen which is quite 
> expensive and the output partitions might be quite large for production table 
> such as hundreds of thousands partitions. We encountered this issue when 
> applying SPJ with multiple large Iceberg tables and the plan phase took tens 
> of minutes to complete.
> Attaching a screenshot to provide related stack trace:
>   !screenshot-1.png! 
> A simple fix for this would be caching the rowOrdering for 
> InternalRowComparableWrapper as the datatype of the InternalRow is immutable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48030) InternalRowComparableWrapper should cache rowOrdering to improve performace

2024-04-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48030:


Assignee: YE

> InternalRowComparableWrapper should cache rowOrdering to improve performace
> ---
>
> Key: SPARK-48030
> URL: https://issues.apache.org/jira/browse/SPARK-48030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1, 3.4.3
>Reporter: YE
>Assignee: YE
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> InternalRowComparableWrapper recreates row ordering for each output partition 
> when SPJ is enabled. The row ordering is generated via codegen which is quite 
> expensive and the output partitions might be quite large for production table 
> such as hundreds of thousands partitions. We encountered this issue when 
> applying SPJ with multiple large Iceberg tables and the plan phase took tens 
> of minutes to complete.
> Attaching a screenshot to provide related stack trace:
>   !screenshot-1.png! 
> A simple fix for this would be caching the rowOrdering for 
> InternalRowComparableWrapper as the datatype of the InternalRow is immutable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-04-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-47094.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45267
[https://github.com/apache/spark/pull/45267]

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42040.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42040:


Assignee: Qi Zhu  (was: zhuqi)

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-03-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42040:


Assignee: zhuqi

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45731.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43629
[https://github.com/apache/spark/pull/43629]

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45731:


Assignee: Chao Sun

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45846) spark.sql.optimizeNullAwareAntiJoin should respect spark.sql.autoBroadcastJoinThreshold

2023-11-08 Thread Chao Sun (Jira)
Chao Sun created SPARK-45846:


 Summary: spark.sql.optimizeNullAwareAntiJoin should respect 
spark.sql.autoBroadcastJoinThreshold
 Key: SPARK-45846
 URL: https://issues.apache.org/jira/browse/SPARK-45846
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chao Sun


Normally broadcast join can be disabled when users set 
{{spark.sql.autoBroadcastJoinThreshold}} to -1. However this doesn't apply to 
{{spark.sql.optimizeNullAwareAntiJoin}}:

{code}
  case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) =>
Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, 
BuildRight,
  None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = 
true))
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-10-30 Thread Chao Sun (Jira)
Chao Sun created SPARK-45731:


 Summary: Update partition statistics with ANALYZE TABLE command
 Key: SPARK-45731
 URL: https://issues.apache.org/jira/browse/SPARK-45731
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Chao Sun


Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
partition stats, even though it can be applied to both non-partitioned and 
partitioned tables. It seems make sense for it to update partition stats as 
well.

Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, but 
the syntax is more verbose as they need to specify all the partition columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-30 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-45652:
-
Description: 
When the number of input partitions become 0 after dynamic filtering, in 
{{BatchScanExec}}, currently SPJ will fail with error:
{code}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
{code}

This is because {{groupPartitions}} will return {{None}} for this case.

  was:
When the number of input partitions become 0 after dynamic filtering, in 
{{BatchScanExec}}, currently SPJ will fail with error:
{code}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
{code}

This is because {{groupPartitions}} will return {{None}} for this case.


> SPJ: Handle empty input partitions after dynamic filtering
> --
>
> Key: SPARK-45652
> URL: https://issues.apache.org/jira/browse/SPARK-45652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> When the number of input partitions become 0 after dynamic filtering, in 
> {{BatchScanExec}}, currently SPJ will fail with error:
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
> {code}
> This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45678) Cover BufferReleasingInputStream.available under tryOrFetchFailedException

2023-10-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45678:


Assignee: L. C. Hsieh

> Cover BufferReleasingInputStream.available under tryOrFetchFailedException
> --
>
> Key: SPARK-45678
> URL: https://issues.apache.org/jira/browse/SPARK-45678
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>  Labels: pull-request-available
>
> We have encountered shuffle data corruption issue:
> ```
> Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:543)
>   at 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450)
>   at 
> org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356)
>  ```
> Spark shuffle has capacity to detect corruption for a few stream op like 
> `read` and `skip`, such `IOException` in the stack trace will be rethrown as 
> `FetchFailedException` that will re-try the failed shuffle task. But in the 
> stack trace it is `available` that is not covered by the mechanism. So 
> no-retry has been happened and the Spark application just failed.
> As the `available` op will also involve data decompression, we should be able 
> to check it like `read` and `skip` do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45678) Cover BufferReleasingInputStream.available under tryOrFetchFailedException

2023-10-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45678.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43543
[https://github.com/apache/spark/pull/43543]

> Cover BufferReleasingInputStream.available under tryOrFetchFailedException
> --
>
> Key: SPARK-45678
> URL: https://issues.apache.org/jira/browse/SPARK-45678
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We have encountered shuffle data corruption issue:
> ```
> Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:543)
>   at 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450)
>   at 
> org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356)
>  ```
> Spark shuffle has capacity to detect corruption for a few stream op like 
> `read` and `skip`, such `IOException` in the stack trace will be rethrown as 
> `FetchFailedException` that will re-try the failed shuffle task. But in the 
> stack trace it is `available` that is not covered by the mechanism. So 
> no-retry has been happened and the Spark application just failed.
> As the `available` op will also involve data decompression, we should be able 
> to check it like `read` and `skip` do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45652.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43531
[https://github.com/apache/spark/pull/43531]

> SPJ: Handle empty input partitions after dynamic filtering
> --
>
> Key: SPARK-45652
> URL: https://issues.apache.org/jira/browse/SPARK-45652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When the number of input partitions become 0 after dynamic filtering, in 
> {{BatchScanExec}}, currently SPJ will fail with error:
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> {code}
> This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45652:


Assignee: Chao Sun

> SPJ: Handle empty input partitions after dynamic filtering
> --
>
> Key: SPARK-45652
> URL: https://issues.apache.org/jira/browse/SPARK-45652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> When the number of input partitions become 0 after dynamic filtering, in 
> {{BatchScanExec}}, currently SPJ will fail with error:
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> {code}
> This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-24 Thread Chao Sun (Jira)
Chao Sun created SPARK-45652:


 Summary: SPJ: Handle empty input partitions after dynamic filtering
 Key: SPARK-45652
 URL: https://issues.apache.org/jira/browse/SPARK-45652
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.1
Reporter: Chao Sun


When the number of input partitions become 0 after dynamic filtering, in 
{{BatchScanExec}}, currently SPJ will fail with error:
{code}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
at 
org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
{code}

This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method

2023-09-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-44913:


Assignee: Xianyang Liu

> DS V2 supports push down V2 UDF that has magic method
> -
>
> Key: SPARK-44913
> URL: https://issues.apache.org/jira/browse/SPARK-44913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>  Labels: pull-request-available
>
> Right now we only support pushing down the V2 UDF that has not a magic 
> method. Because the V2 UDF will be analyzed into the 
> `ApplyFunctionExpression` which could be translated and pushed down. However, 
> a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or 
> `Invoke` that can not be translated into V2 expression and then can not be 
> pushed down to the data source. The magic method is suggested. So this PR 
> adds the support of pushing down the V2 UDF that has a magic method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method

2023-09-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-44913.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42612
[https://github.com/apache/spark/pull/42612]

> DS V2 supports push down V2 UDF that has magic method
> -
>
> Key: SPARK-44913
> URL: https://issues.apache.org/jira/browse/SPARK-44913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now we only support pushing down the V2 UDF that has not a magic 
> method. Because the V2 UDF will be analyzed into the 
> `ApplyFunctionExpression` which could be translated and pushed down. However, 
> a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or 
> `Invoke` that can not be translated into V2 expression and then can not be 
> pushed down to the data source. The magic method is suggested. So this PR 
> adds the support of pushing down the V2 UDF that has a magic method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45365) Allow the daily tests of branch-3.4 to use the new test group tags

2023-09-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45365:


Assignee: Yang Jie

> Allow the daily tests of branch-3.4 to use the new test group tags
> --
>
> Key: SPARK-45365
> URL: https://issues.apache.org/jira/browse/SPARK-45365
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45365) Allow the daily tests of branch-3.4 to use the new test group tags

2023-09-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45365.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Allow the daily tests of branch-3.4 to use the new test group tags
> --
>
> Key: SPARK-45365
> URL: https://issues.apache.org/jira/browse/SPARK-45365
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36695) Allow passing V2 functions to data sources via V2 filters

2023-09-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-36695.
--
Resolution: Duplicate

> Allow passing V2 functions to data sources via V2 filters
> -
>
> Key: SPARK-36695
> URL: https://issues.apache.org/jira/browse/SPARK-36695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> The V2 filter API currently only allow {{NamedReference}} in predicates that 
> are pushed down to data sources. It may be beneficial to allow V2 functions 
> in predicates as well so that we can implement function pushdown. This 
> feature is also supported by Trino (Presto).
> One use case is we can pushdown predicates such as {{bucket(col, 32) = 10}} 
> which will allow data sources such as Iceberg to only scan a single partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44647) SPJ: Support SPJ when join key is subset of partition keys

2023-09-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-44647:
-
Summary: SPJ: Support SPJ when join key is subset of partition keys  (was: 
Support SPJ when join key is subset of partition keys)

> SPJ: Support SPJ when join key is subset of partition keys
> --
>
> Key: SPARK-44647
> URL: https://issues.apache.org/jira/browse/SPARK-44647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats

2023-09-01 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-45054:
-
Affects Version/s: 3.4.1
   3.3.2
   3.2.4
   (was: 3.5.0)

> HiveExternalCatalog.listPartitions should restore Spark SQL stats
> -
>
> Key: SPARK-45054
> URL: https://issues.apache.org/jira/browse/SPARK-45054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.2, 3.4.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> If partitions are stored in HMS with Spark populated stats such as 
> {{spark.sql.statistics.totalSize}}, currently 
> {{HiveExternalCatalog.listPartitions}} doesn't call 
> {{restorePartitionMetadata}} to restore those stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats

2023-09-01 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-45054:
-
Fix Version/s: 3.4.2
   3.5.0

> HiveExternalCatalog.listPartitions should restore Spark SQL stats
> -
>
> Key: SPARK-45054
> URL: https://issues.apache.org/jira/browse/SPARK-45054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> If partitions are stored in HMS with Spark populated stats such as 
> {{spark.sql.statistics.totalSize}}, currently 
> {{HiveExternalCatalog.listPartitions}} doesn't call 
> {{restorePartitionMetadata}} to restore those stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats

2023-09-01 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45054.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42777
[https://github.com/apache/spark/pull/42777]

> HiveExternalCatalog.listPartitions should restore Spark SQL stats
> -
>
> Key: SPARK-45054
> URL: https://issues.apache.org/jira/browse/SPARK-45054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 4.0.0
>
>
> If partitions are stored in HMS with Spark populated stats such as 
> {{spark.sql.statistics.totalSize}}, currently 
> {{HiveExternalCatalog.listPartitions}} doesn't call 
> {{restorePartitionMetadata}} to restore those stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats

2023-09-01 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45054:


Assignee: Chao Sun

> HiveExternalCatalog.listPartitions should restore Spark SQL stats
> -
>
> Key: SPARK-45054
> URL: https://issues.apache.org/jira/browse/SPARK-45054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> If partitions are stored in HMS with Spark populated stats such as 
> {{spark.sql.statistics.totalSize}}, currently 
> {{HiveExternalCatalog.listPartitions}} doesn't call 
> {{restorePartitionMetadata}} to restore those stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45054) HiveExternalCatalog.listPartitions should restore Spark SQL stats

2023-09-01 Thread Chao Sun (Jira)
Chao Sun created SPARK-45054:


 Summary: HiveExternalCatalog.listPartitions should restore Spark 
SQL stats
 Key: SPARK-45054
 URL: https://issues.apache.org/jira/browse/SPARK-45054
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Chao Sun


If partitions are stored in HMS with Spark populated stats such as 
{{spark.sql.statistics.totalSize}}, currently 
{{HiveExternalCatalog.listPartitions}} doesn't call 
{{restorePartitionMetadata}} to restore those stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45036) SPJ: Refactor logic to handle partially clustered distribution

2023-08-31 Thread Chao Sun (Jira)
Chao Sun created SPARK-45036:


 Summary: SPJ: Refactor logic to handle partially clustered 
distribution
 Key: SPARK-45036
 URL: https://issues.apache.org/jira/browse/SPARK-45036
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Chao Sun


The current logic in handling partially clustered distribution is a bit 
complicated. This JIRA proposes to simplify it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41471) SPJ: Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning

2023-08-24 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-41471:


Assignee: Jia Fan

> SPJ: Reduce Spark shuffle when only one side of a join is 
> KeyGroupedPartitioning
> 
>
> Key: SPARK-41471
> URL: https://issues.apache.org/jira/browse/SPARK-41471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Jia Fan
>Priority: Major
> Fix For: 4.0.0
>
>
> When only one side of a SPJ (Storage-Partitioned Join) is 
> {{{}KeyGroupedPartitioning{}}}, Spark currently needs to shuffle both sides 
> using {{{}HashPartitioning{}}}. However, we may just need to shuffle the 
> other side according to the partition transforms defined in 
> {{{}KeyGroupedPartitioning{}}}. This is especially useful when the other side 
> is relatively small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41471) SPJ: Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning

2023-08-24 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-41471.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42194
[https://github.com/apache/spark/pull/42194]

> SPJ: Reduce Spark shuffle when only one side of a join is 
> KeyGroupedPartitioning
> 
>
> Key: SPARK-41471
> URL: https://issues.apache.org/jira/browse/SPARK-41471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
> Fix For: 4.0.0
>
>
> When only one side of a SPJ (Storage-Partitioned Join) is 
> {{{}KeyGroupedPartitioning{}}}, Spark currently needs to shuffle both sides 
> using {{{}HashPartitioning{}}}. However, we may just need to shuffle the 
> other side according to the partition transforms defined in 
> {{{}KeyGroupedPartitioning{}}}. This is especially useful when the other side 
> is relatively small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-08-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-44641.
--
Fix Version/s: 3.4.2
   3.5.0
 Assignee: Chao Sun
   Resolution: Fixed

> SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> --
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Szehon Ho
>Assignee: Chao Sun
>Priority: Blocker
> Fix For: 3.4.2, 3.5.0
>
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44660) Relax constraint for columnar shuffle check in AQE

2023-08-03 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750881#comment-17750881
 ] 

Chao Sun commented on SPARK-44660:
--

In fact the check is necessary, but it seems 
{code}
postStageCreationRules(outputsColumnar = plan.supportsColumnar)
{code}

can be relaxed: if the new shuffle operator supports columnar, then maybe we 
shouldn't insert {{ColumnarToRow}} to this stage. This is assuming the 
following stage knows the shuffle output is columnar and has corresponding 
{{ColumnarToRow}} if necessary.

> Relax constraint for columnar shuffle check in AQE
> --
>
> Key: SPARK-44660
> URL: https://issues.apache.org/jira/browse/SPARK-44660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently in AQE, after evaluating the columnar rules, Spark will check if 
> the top operator of the stage is still a shuffle operator, and throw 
> exception if it doesn't.
> {code}
> val optimized = e.withNewChildren(Seq(optimizeQueryStage(e.child, 
> isFinalStage = false)))
> val newPlan = applyPhysicalRules(
>   optimized,
>   postStageCreationRules(outputsColumnar = plan.supportsColumnar),
>   Some((planChangeLogger, "AQE Post Stage Creation")))
> if (e.isInstanceOf[ShuffleExchangeLike]) {
>   if (!newPlan.isInstanceOf[ShuffleExchangeLike]) {
> throw SparkException.internalError(
>   "Custom columnar rules cannot transform shuffle node to 
> something else.")
>   }
> {code}
> However, once a shuffle operator is transformed into a custom columnar 
> shuffle operator, the {{supportsColumnar}} of the new shuffle operator will 
> return true, and therefore the columnar rules will insert {{ColumnarToRow}} 
> on top of it. This means the {{newPlan}} is likely no longer a 
> {{ShuffleExchangeLike}} but a {{ColumnarToRow}}, and exception will be 
> thrown, even though the use case is valid.
> This JIRA proposes to relax the check by allowing the above case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-08-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-44641:
-
Priority: Blocker  (was: Major)

> SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> --
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Szehon Ho
>Priority: Blocker
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44660) Relax constraint for columnar shuffle check in AQE

2023-08-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-44660:


 Summary: Relax constraint for columnar shuffle check in AQE
 Key: SPARK-44660
 URL: https://issues.apache.org/jira/browse/SPARK-44660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Chao Sun


Currently in AQE, after evaluating the columnar rules, Spark will check if the 
top operator of the stage is still a shuffle operator, and throw exception if 
it doesn't.

{code}
val optimized = e.withNewChildren(Seq(optimizeQueryStage(e.child, 
isFinalStage = false)))
val newPlan = applyPhysicalRules(
  optimized,
  postStageCreationRules(outputsColumnar = plan.supportsColumnar),
  Some((planChangeLogger, "AQE Post Stage Creation")))
if (e.isInstanceOf[ShuffleExchangeLike]) {
  if (!newPlan.isInstanceOf[ShuffleExchangeLike]) {
throw SparkException.internalError(
  "Custom columnar rules cannot transform shuffle node to something 
else.")
  }
{code}

However, once a shuffle operator is transformed into a custom columnar shuffle 
operator, the {{supportsColumnar}} of the new shuffle operator will return 
true, and therefore the columnar rules will insert {{ColumnarToRow}} on top of 
it. This means the {{newPlan}} is likely no longer a {{ShuffleExchangeLike}} 
but a {{ColumnarToRow}}, and exception will be thrown, even though the use case 
is valid.

This JIRA proposes to relax the check by allowing the above case.







--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44659) SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality check

2023-08-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-44659:
-
Summary: SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams 
equality check  (was: Include keyGroupedPartitioning in 
StoragePartitionJoinParams equality check)

> SPJ: Include keyGroupedPartitioning in StoragePartitionJoinParams equality 
> check
> 
>
> Key: SPARK-44659
> URL: https://issues.apache.org/jira/browse/SPARK-44659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently {{StoragePartitionJoinParams}} doesn't include 
> {{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. 
> For completeness, we should include it as well since it is a member of the 
> class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-08-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-44641:
-
Summary: SPJ: Results duplicated when SPJ partial-cluster and pushdown 
enabled but conditions unmet  (was: Results duplicated when SPJ partial-cluster 
and pushdown enabled but conditions unmet)

> SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> --
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Szehon Ho
>Priority: Major
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44659) Include keyGroupedPartitioning in StoragePartitionJoinParams equality check

2023-08-03 Thread Chao Sun (Jira)
Chao Sun created SPARK-44659:


 Summary: Include keyGroupedPartitioning in 
StoragePartitionJoinParams equality check
 Key: SPARK-44659
 URL: https://issues.apache.org/jira/browse/SPARK-44659
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Chao Sun


Currently {{StoragePartitionJoinParams}} doesn't include 
{{keyGroupedPartitioning}} in its {{equals}} and {{hashCode}} computation. For 
completeness, we should include it as well since it is a member of the class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44641) Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-08-03 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-44641:
-
Parent: SPARK-37375
Issue Type: Sub-task  (was: Bug)

> Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> -
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Szehon Ho
>Priority: Major
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec

2023-07-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-42454:
-
Fix Version/s: 3.5.0
   (was: 4.0.0)

> SPJ: encapsulate all SPJ related parameters in BatchScanExec
> 
>
> Key: SPARK-42454
> URL: https://issues.apache.org/jira/browse/SPARK-42454
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Szehon Ho
>Priority: Minor
> Fix For: 3.5.0
>
>
> The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is 
> annoying since there are many places which do pattern-matching on 
> {{BatchScanExec}} and they have to change accordingly. 
> To make this less disruptive, we can introduce a struct for all the SPJ 
> classes and use that as the parameter for {{BatchScanExec}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec

2023-07-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42454.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41990
[https://github.com/apache/spark/pull/41990]

> SPJ: encapsulate all SPJ related parameters in BatchScanExec
> 
>
> Key: SPARK-42454
> URL: https://issues.apache.org/jira/browse/SPARK-42454
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Minor
> Fix For: 4.0.0
>
>
> The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is 
> annoying since there are many places which do pattern-matching on 
> {{BatchScanExec}} and they have to change accordingly. 
> To make this less disruptive, we can introduce a struct for all the SPJ 
> classes and use that as the parameter for {{BatchScanExec}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42454) SPJ: encapsulate all SPJ related parameters in BatchScanExec

2023-07-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42454:


Assignee: Szehon Ho

> SPJ: encapsulate all SPJ related parameters in BatchScanExec
> 
>
> Key: SPARK-42454
> URL: https://issues.apache.org/jira/browse/SPARK-42454
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Szehon Ho
>Priority: Minor
> Fix For: 4.0.0
>
>
> The list of SPJ parameters in {{BatchScanExec}} keeps growing, which is 
> annoying since there are many places which do pattern-matching on 
> {{BatchScanExec}} and they have to change accordingly. 
> To make this less disruptive, we can introduce a struct for all the SPJ 
> classes and use that as the parameter for {{BatchScanExec}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join

2023-06-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-36612.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

> Support left outer join build left or right outer join build right in 
> shuffled hash join
> 
>
> Key: SPARK-36612
> URL: https://issues.apache.org/jira/browse/SPARK-36612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: Szehon Ho
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently spark sql does not support build left side when left outer join (or 
> build right side when right outer join).
> However, in our production environment, there are a large number of scenarios 
> where small tables are left join large tables, and many times, large tables 
> have data skew (currently AQE can't handle this kind of skew).
> Inspired by SPARK-32399, we can use similar ideas to realize left outer join 
> build left.
> I think this treatment is very meaningful, but I don’t know how members 
> consider this matter?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join

2023-06-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-36612:


Assignee: Szehon Ho

> Support left outer join build left or right outer join build right in 
> shuffled hash join
> 
>
> Key: SPARK-36612
> URL: https://issues.apache.org/jira/browse/SPARK-36612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: Szehon Ho
>Priority: Major
>
> Currently spark sql does not support build left side when left outer join (or 
> build right side when right outer join).
> However, in our production environment, there are a large number of scenarios 
> where small tables are left join large tables, and many times, large tables 
> have data skew (currently AQE can't handle this kind of skew).
> Inspired by SPARK-32399, we can use similar ideas to realize left outer join 
> build left.
> I think this treatment is very meaningful, but I don’t know how members 
> consider this matter?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43758) Upgrade snappy-java to 1.1.10.0

2023-05-23 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-43758:
-
Issue Type: Bug  (was: Improvement)

> Upgrade snappy-java to 1.1.10.0
> ---
>
> Key: SPARK-43758
> URL: https://issues.apache.org/jira/browse/SPARK-43758
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Priority: Major
>
> Update {{snappy-java}} to 1.1.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43758) Upgrade snappy-java to 1.1.10.0

2023-05-23 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-43758:
-
Affects Version/s: 3.4.0
   (was: 3.5.0)

> Upgrade snappy-java to 1.1.10.0
> ---
>
> Key: SPARK-43758
> URL: https://issues.apache.org/jira/browse/SPARK-43758
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Chao Sun
>Priority: Major
>
> Update {{snappy-java}} to 1.1.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43758) Upgrade snappy-java to 1.1.10.0

2023-05-23 Thread Chao Sun (Jira)
Chao Sun created SPARK-43758:


 Summary: Upgrade snappy-java to 1.1.10.0
 Key: SPARK-43758
 URL: https://issues.apache.org/jira/browse/SPARK-43758
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Chao Sun


Update {{snappy-java}} to 1.1.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43494:


Assignee: Yang Jie

> Directly call `replicate()` instead of reflection in 
> `SparkHadoopUtil#createFile`
> -
>
> Key: SPARK-43494
> URL: https://issues.apache.org/jira/browse/SPARK-43494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43494.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41164
[https://github.com/apache/spark/pull/41164]

> Directly call `replicate()` instead of reflection in 
> `SparkHadoopUtil#createFile`
> -
>
> Key: SPARK-43494
> URL: https://issues.apache.org/jira/browse/SPARK-43494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`

2023-05-12 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43272:


Assignee: Yang Jie

> Replace reflection w/ direct calling for  `SparkHadoopUtil#createFile`
> --
>
> Key: SPARK-43272
> URL: https://issues.apache.org/jira/browse/SPARK-43272
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`

2023-05-12 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43272.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40945
[https://github.com/apache/spark/pull/40945]

> Replace reflection w/ direct calling for  `SparkHadoopUtil#createFile`
> --
>
> Key: SPARK-43272
> URL: https://issues.apache.org/jira/browse/SPARK-43272
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43484) Kafka/Kinesis Assembly should not package hadoop-client-runtime

2023-05-12 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43484.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41152
[https://github.com/apache/spark/pull/41152]

> Kafka/Kinesis Assembly should not package hadoop-client-runtime
> ---
>
> Key: SPARK-43484
> URL: https://issues.apache.org/jira/browse/SPARK-43484
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Structured Streaming
>Affects Versions: 3.2.4, 3.3.2, 3.4.0, 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43484) Kafka/Kinesis Assembly should not package hadoop-client-runtime

2023-05-12 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43484:


Assignee: Cheng Pan

> Kafka/Kinesis Assembly should not package hadoop-client-runtime
> ---
>
> Key: SPARK-43484
> URL: https://issues.apache.org/jira/browse/SPARK-43484
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Structured Streaming
>Affects Versions: 3.2.4, 3.3.2, 3.4.0, 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43410:


Assignee: xiaochen zhou

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Assignee: xiaochen zhou
>Priority: Minor
> Fix For: 3.5.0
>
>
> Improve vectorized loop for Packed skipValues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43410.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41092
[https://github.com/apache/spark/pull/41092]

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Priority: Minor
> Fix For: 3.5.0
>
>
> Improve vectorized loop for Packed skipValues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43248) Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43248.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40920
[https://github.com/apache/spark/pull/40920]

> Unnecessary serialize/deserialize of Path on parallel gather partition stats
> 
>
> Key: SPARK-43248
> URL: https://issues.apache.org/jira/browse/SPARK-43248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43248) Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-26 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43248:


Assignee: Cheng Pan

> Unnecessary serialize/deserialize of Path on parallel gather partition stats
> 
>
> Key: SPARK-43248
> URL: https://issues.apache.org/jira/browse/SPARK-43248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43268) Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43268:


Assignee: Anton Okolnychyi

> Use proper error classes when exceptions are constructed with a message
> ---
>
> Key: SPARK-43268
> URL: https://issues.apache.org/jira/browse/SPARK-43268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> As discussed 
> [here|https://github.com/apache/spark/pull/40679/files#r1159264585].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43268) Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43268.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40934
[https://github.com/apache/spark/pull/40934]

> Use proper error classes when exceptions are constructed with a message
> ---
>
> Key: SPARK-43268
> URL: https://issues.apache.org/jira/browse/SPARK-43268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> As discussed 
> [here|https://github.com/apache/spark/pull/40679/files#r1159264585].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43211.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40870
[https://github.com/apache/spark/pull/40870]

> Remove Hadoop2 support in IsolatedClientLoader
> --
>
> Key: SPARK-43211
> URL: https://issues.apache.org/jira/browse/SPARK-43211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43211:


Assignee: Cheng Pan

> Remove Hadoop2 support in IsolatedClientLoader
> --
>
> Key: SPARK-43211
> URL: https://issues.apache.org/jira/browse/SPARK-43211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43202.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40860
[https://github.com/apache/spark/pull/40860]

> Replace reflection w/ direct calling for YARN Resource API
> --
>
> Key: SPARK-43202
> URL: https://issues.apache.org/jira/browse/SPARK-43202
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43202:


Assignee: Cheng Pan

> Replace reflection w/ direct calling for YARN Resource API
> --
>
> Key: SPARK-43202
> URL: https://issues.apache.org/jira/browse/SPARK-43202
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43208:


Assignee: Cheng Pan

> IsolatedClassLoader should close barrier class InputStream after reading
> 
>
> Key: SPARK-43208
> URL: https://issues.apache.org/jira/browse/SPARK-43208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading

2023-04-20 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43208.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40867
[https://github.com/apache/spark/pull/40867]

> IsolatedClassLoader should close barrier class InputStream after reading
> 
>
> Key: SPARK-43208
> URL: https://issues.apache.org/jira/browse/SPARK-43208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43196:


Assignee: Yang Jie

> Replace reflection w/ direct calling for 
> `ContainerLaunchContext#setTokensConf`
> ---
>
> Key: SPARK-43196
> URL: https://issues.apache.org/jira/browse/SPARK-43196
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43196.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40855
[https://github.com/apache/spark/pull/40855]

> Replace reflection w/ direct calling for 
> `ContainerLaunchContext#setTokensConf`
> ---
>
> Key: SPARK-43196
> URL: https://issues.apache.org/jira/browse/SPARK-43196
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43191.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40850
[https://github.com/apache/spark/pull/40850]

> Replace reflection w/ direct calling for Hadoop CallerContext 
> --
>
> Key: SPARK-43191
> URL: https://issues.apache.org/jira/browse/SPARK-43191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43191:


Assignee: Cheng Pan

> Replace reflection w/ direct calling for Hadoop CallerContext 
> --
>
> Key: SPARK-43191
> URL: https://issues.apache.org/jira/browse/SPARK-43191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43200) Remove Hadoop 2 reference in docs

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43200.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40857
[https://github.com/apache/spark/pull/40857]

> Remove Hadoop 2 reference in docs
> -
>
> Key: SPARK-43200
> URL: https://issues.apache.org/jira/browse/SPARK-43200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43200) Remove Hadoop 2 reference in docs

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43200:


Assignee: Cheng Pan

> Remove Hadoop 2 reference in docs
> -
>
> Key: SPARK-43200
> URL: https://issues.apache.org/jira/browse/SPARK-43200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43195.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40854
[https://github.com/apache/spark/pull/40854]

> Remove unnecessary serializable wrapper in HadoopFSUtils
> 
>
> Key: SPARK-43195
> URL: https://issues.apache.org/jira/browse/SPARK-43195
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43195:


Assignee: Cheng Pan

> Remove unnecessary serializable wrapper in HadoopFSUtils
> 
>
> Key: SPARK-43195
> URL: https://issues.apache.org/jira/browse/SPARK-43195
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43187.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40849
[https://github.com/apache/spark/pull/40849]

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43187:


Assignee: Cheng Pan

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43186.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40848
[https://github.com/apache/spark/pull/40848]

> Remove workaround for FileSinkDesc
> --
>
> Key: SPARK-43186
> URL: https://issues.apache.org/jira/browse/SPARK-43186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43186:


Assignee: Cheng Pan

> Remove workaround for FileSinkDesc
> --
>
> Key: SPARK-43186
> URL: https://issues.apache.org/jira/browse/SPARK-43186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42452) Remove hadoop-2 profile from Apache Spark

2023-04-18 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42452.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

> Remove hadoop-2 profile from Apache Spark
> -
>
> Key: SPARK-42452
> URL: https://issues.apache.org/jira/browse/SPARK-42452
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-40651 Drop Hadoop2 binary distribtuion from release process and 
> SPARK-42447 Remove Hadoop 2 GitHub Action job
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters in vectorized reader

2023-04-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42388.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39950
[https://github.com/apache/spark/pull/39950]

> Avoid unnecessary parquet footer reads when no filters in vectorized reader
> ---
>
> Key: SPARK-42388
> URL: https://issues.apache.org/jira/browse/SPARK-42388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Mars
>Assignee: Mars
>Priority: Major
> Fix For: 3.5.0
>
>
> Parquet footer is now read twice even if there are no filters requiring 
> pushdown in vectorized parquet reader.
> When the NameNode is under high pressure, it will cost time to read twice. 
> Actually we can avoid this unnecessary parquet footer reads and use footer 
> metadata in {{{}VectorizedParquetRecordReader{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters in vectorized reader

2023-04-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-42388:


Assignee: Mars

> Avoid unnecessary parquet footer reads when no filters in vectorized reader
> ---
>
> Key: SPARK-42388
> URL: https://issues.apache.org/jira/browse/SPARK-42388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Parquet footer is now read twice even if there are no filters requiring 
> pushdown in vectorized parquet reader.
> When the NameNode is under high pressure, it will cost time to read twice. 
> Actually we can avoid this unnecessary parquet footer reads and use footer 
> metadata in {{{}VectorizedParquetRecordReader{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43150) Remove workaround for PARQUET-2160

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43150:


Assignee: Cheng Pan

> Remove workaround for PARQUET-2160
> --
>
> Key: SPARK-43150
> URL: https://issues.apache.org/jira/browse/SPARK-43150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43150) Remove workaround for PARQUET-2160

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43150.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40802
[https://github.com/apache/spark/pull/40802]

> Remove workaround for PARQUET-2160
> --
>
> Key: SPARK-43150
> URL: https://issues.apache.org/jira/browse/SPARK-43150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43064) Spark SQL CLI SQL tab should only show once statement once

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43064.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40701
[https://github.com/apache/spark/pull/40701]

> Spark SQL CLI SQL tab should only show once statement once
> --
>
> Key: SPARK-43064
> URL: https://issues.apache.org/jira/browse/SPARK-43064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=996,height=554!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43064) Spark SQL CLI SQL tab should only show once statement once

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43064:


Assignee: angerszhu

> Spark SQL CLI SQL tab should only show once statement once
> --
>
> Key: SPARK-43064
> URL: https://issues.apache.org/jira/browse/SPARK-43064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=996,height=554!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43104) Set `shadeTestJar` of protobuf module to false

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43104:


Assignee: Yang Jie

> Set `shadeTestJar` of protobuf module to false
> --
>
> Key: SPARK-43104
> URL: https://issues.apache.org/jira/browse/SPARK-43104
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43104) Set `shadeTestJar` of protobuf module to false

2023-04-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43104.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40753
[https://github.com/apache/spark/pull/40753]

> Set `shadeTestJar` of protobuf module to false
> --
>
> Key: SPARK-43104
> URL: https://issues.apache.org/jira/browse/SPARK-43104
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >