[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Affects Version/s: 3.3.2

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Priority: Minor  (was: Trivial)

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43589:
--
Issue Type: Bug  (was: Improvement)

> Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
> ---
>
> Key: SPARK-43589
> URL: https://issues.apache.org/jira/browse/SPARK-43589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43589) Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`

2023-05-18 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43589:
-

 Summary: Fix `cannotBroadcastTableOverMaxTableBytesError` to use 
`bytesToString`
 Key: SPARK-43589
 URL: https://issues.apache.org/jira/browse/SPARK-43589
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43587:
-

Assignee: Dongjoon Hyun

> Run HealthTrackerIntegrationSuite in a dedicate JVM
> ---
>
> Key: SPARK-43587
> URL: https://issues.apache.org/jira/browse/SPARK-43587
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43587.
---
Fix Version/s: 3.3.3
   3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41229
[https://github.com/apache/spark/pull/41229]

> Run HealthTrackerIntegrationSuite in a dedicate JVM
> ---
>
> Key: SPARK-43587
> URL: https://issues.apache.org/jira/browse/SPARK-43587
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.3, 3.5.0, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43588) Upgrade ASM to 9.5

2023-05-18 Thread Yang Jie (Jira)
Yang Jie created SPARK-43588:


 Summary: Upgrade ASM to 9.5
 Key: SPARK-43588
 URL: https://issues.apache.org/jira/browse/SPARK-43588
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


ASM 9.5 is for Java 21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43588) Upgrade ASM to 9.5

2023-05-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43588:
-
Description: 
ASM 9.5 is for Java 21

 

https://asm.ow2.io/versions.html

  was:ASM 9.5 is for Java 21


> Upgrade ASM to 9.5
> --
>
> Key: SPARK-43588
> URL: https://issues.apache.org/jira/browse/SPARK-43588
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> ASM 9.5 is for Java 21
>  
> https://asm.ow2.io/versions.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43582) Upgrade `sbt-pom-reader` to 2.4.0

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43582.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41224
[https://github.com/apache/spark/pull/41224]

> Upgrade `sbt-pom-reader` to 2.4.0
> -
>
> Key: SPARK-43582
> URL: https://issues.apache.org/jira/browse/SPARK-43582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43582) Upgrade `sbt-pom-reader` to 2.4.0

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43582:
-

Assignee: BingKun Pan

> Upgrade `sbt-pom-reader` to 2.4.0
> -
>
> Key: SPARK-43582
> URL: https://issues.apache.org/jira/browse/SPARK-43582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43587:
--
Affects Version/s: 3.4.0
   3.3.2
   (was: 3.5.0)

> Run HealthTrackerIntegrationSuite in a dedicate JVM
> ---
>
> Key: SPARK-43587
> URL: https://issues.apache.org/jira/browse/SPARK-43587
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43586) There will be many invalid tasks when `Range.numSlices` > `Range.numElements`

2023-05-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43586:
-
Priority: Minor  (was: Major)

> There will be many invalid tasks when `Range.numSlices` > `Range.numElements`
> -
>
> Key: SPARK-43586
> URL: https://issues.apache.org/jira/browse/SPARK-43586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2023-05-19-13-01-19-589.png
>
>
> For example, start a spark shell with `--master "local[100]"`, then run 
> `spark.range(10).map(_ + 1).reduce(_ + _)`, there will be 100 tasks in the 
> job, although there are only 10 elements in the Range:
> !image-2023-05-19-13-01-19-589.png|width=733,height=203!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43586) There will be many invalid tasks when `Range.numSlices` > `Range.numElements`

2023-05-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43586:
-
Description: 
For example, start a spark shell with `--master "local[100]"`, then run 
`spark.range(10).map(_ + 1).reduce(_ + _)`, there will be 100 tasks in the job, 
although there are only 10 elements in the Range:

!image-2023-05-19-13-01-19-589.png|width=733,height=203!

 

> There will be many invalid tasks when `Range.numSlices` > `Range.numElements`
> -
>
> Key: SPARK-43586
> URL: https://issues.apache.org/jira/browse/SPARK-43586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-05-19-13-01-19-589.png
>
>
> For example, start a spark shell with `--master "local[100]"`, then run 
> `spark.range(10).map(_ + 1).reduce(_ + _)`, there will be 100 tasks in the 
> job, although there are only 10 elements in the Range:
> !image-2023-05-19-13-01-19-589.png|width=733,height=203!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43586) There will be many invalid tasks when `Range.numSlices` > `Range.numElements`

2023-05-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43586:
-
Attachment: image-2023-05-19-13-01-19-589.png

> There will be many invalid tasks when `Range.numSlices` > `Range.numElements`
> -
>
> Key: SPARK-43586
> URL: https://issues.apache.org/jira/browse/SPARK-43586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-05-19-13-01-19-589.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43572) Add a test for scrollable result set through thrift server

2023-05-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43572.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41213
[https://github.com/apache/spark/pull/41213]

> Add a test for scrollable result set through thrift server
> --
>
> Key: SPARK-43572
> URL: https://issues.apache.org/jira/browse/SPARK-43572
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>
> improve jdbc server test coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43572) Add a test for scrollable result set through thrift server

2023-05-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-43572:
---

Assignee: Kent Yao

> Add a test for scrollable result set through thrift server
> --
>
> Key: SPARK-43572
> URL: https://issues.apache.org/jira/browse/SPARK-43572
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>
> improve jdbc server test coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43586) There will be many invalid tasks when `Range.numSlices` > `Range.numElements`

2023-05-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43586:
-
Summary: There will be many invalid tasks when `Range.numSlices` > 
`Range.numElements`  (was: there will be many invalid tasks when 
`Range.numSlices` > `Range.numElements`)

> There will be many invalid tasks when `Range.numSlices` > `Range.numElements`
> -
>
> Key: SPARK-43586
> URL: https://issues.apache.org/jira/browse/SPARK-43586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43586) there will be many invalid tasks when `Range.numSlices` > `Range.numElements`

2023-05-18 Thread Yang Jie (Jira)
Yang Jie created SPARK-43586:


 Summary: there will be many invalid tasks when `Range.numSlices` > 
`Range.numElements`
 Key: SPARK-43586
 URL: https://issues.apache.org/jira/browse/SPARK-43586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43587) Run HealthTrackerIntegrationSuite in a dedicate JVM

2023-05-18 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43587:
-

 Summary: Run HealthTrackerIntegrationSuite in a dedicate JVM
 Key: SPARK-43587
 URL: https://issues.apache.org/jira/browse/SPARK-43587
 Project: Spark
  Issue Type: Test
  Components: Spark Core, Tests
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43530) Protobuf: Read descriptor file only once at the compile time

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724104#comment-17724104
 ] 

Snoot.io commented on SPARK-43530:
--

User 'rangadi' has created a pull request for this issue:
https://github.com/apache/spark/pull/41192

> Protobuf: Read descriptor file only once at the compile time
> 
>
> Key: SPARK-43530
> URL: https://issues.apache.org/jira/browse/SPARK-43530
> Project: Spark
>  Issue Type: Task
>  Components: Protobuf
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0
>
>
> Protobuf functions read from the descriptor file many time (e.g. at each 
> executor). This is unncessary and error prone (e.g. what if the contents 
> change couple of days after the streaming query starts?).
>  
> It only needs to be read once. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724103#comment-17724103
 ] 

Snoot.io commented on SPARK-43510:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41173

> Spark application hangs when YarnAllocator adds running executors after 
> processing completed containers
> ---
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724099#comment-17724099
 ] 

Yuming Wang commented on SPARK-43526:
-

[~caican] Thank you for investigation.

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.2
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43516) Basic estimator / transformer / model / evaluator interfaces

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724098#comment-17724098
 ] 

Snoot.io commented on SPARK-43516:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/41176

> Basic estimator / transformer / model / evaluator interfaces
> 
>
> Key: SPARK-43516
> URL: https://issues.apache.org/jira/browse/SPARK-43516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43585) Spark Connect client cannot read from Hive metastore

2023-05-18 Thread roland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roland updated SPARK-43585:
---
Description: 
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver


apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 

  was:
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver
apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 


> Spark Connect client cannot read from Hive metastore
> 
>
> Key: SPARK-43585
> URL: https://issues.apache.org/jira/browse/SPARK-43585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: roland
>Priority: Major
>
> I created a Spark Connect shell in a pod using the following yaml.
> {code:java}
> apiVersion: v1
> kind: Service
> metadata:
>   name: spark-connect-svc
>   namespace: MY_NAMESPACE
> spec:
>   clusterIP: None
>   selector:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-connect-pod
>   namespace: realtime-streaming
>   labels:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> spec:
>   restartPolicy: Never
>   containers:
>   - command:
>     - sh
>     - -c
>     - /opt/spark/sbin/start-connect-server.sh --master 
> k8s://https://MY_API_SERVER:443 --packages 
> org.apache.spark:spark-connect_2.12:3.4.0 --conf 
> spark.kubernetes.executor.limit.cores=1.0 --conf 
> spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
> --conf spark.executor.memory=6G --conf 
> spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
> spark.kubernetes.ex

[jira] [Updated] (SPARK-43585) Spark Connect client cannot read from Hive metastore

2023-05-18 Thread roland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roland updated SPARK-43585:
---
Description: 
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver


apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
There're many databases under the hive metastore and I've tested with the local 
env, all works fine. 

 

  was:
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver


apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 


> Spark Connect client cannot read from Hive metastore
> 
>
> Key: SPARK-43585
> URL: https://issues.apache.org/jira/browse/SPARK-43585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: roland
>Priority: Major
>
> I created a Spark Connect shell in a pod using the following yaml.
> {code:java}
> apiVersion: v1
> kind: Service
> metadata:
>   name: spark-connect-svc
>   namespace: MY_NAMESPACE
> spec:
>   clusterIP: None
>   selector:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-connect-pod
>   namespace: realtime-streaming
>   labels:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> spec:
>   restartPolicy: Never
>   containers:
>   - command:
>     - sh
>     - -c
>     - /opt/spark/sbin/start-connect-server.sh --master 
> k8s://https://MY_API_SERVER:443 --packages 
> org.apache.spark:spark-connect_2.12:3.4.0 --conf 
> spark.kubernetes.executor.limit.cores=1.0 --conf 
> spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
> --conf spark.executor.memory

[jira] [Created] (SPARK-43585) Spark Connect client cannot read from Hive metastore

2023-05-18 Thread roland (Jira)
roland created SPARK-43585:
--

 Summary: Spark Connect client cannot read from Hive metastore
 Key: SPARK-43585
 URL: https://issues.apache.org/jira/browse/SPARK-43585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: roland


I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver-
apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43585) Spark Connect client cannot read from Hive metastore

2023-05-18 Thread roland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roland updated SPARK-43585:
---
Description: 
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver
apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 

  was:
I created a Spark Connect shell in a pod using the following yaml.
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: spark-connect-svc
  namespace: MY_NAMESPACE
spec:
  clusterIP: None
  selector:
    app: spark-connect-pod
    podType: spark-connect-driver-
apiVersion: v1
kind: Pod
metadata:
  name: spark-connect-pod
  namespace: realtime-streaming
  labels:
    app: spark-connect-pod
    podType: spark-connect-driver
spec:
  restartPolicy: Never
  containers:
  - command:
    - sh
    - -c
    - /opt/spark/sbin/start-connect-server.sh --master 
k8s://https://MY_API_SERVER:443 --packages 
org.apache.spark:spark-connect_2.12:3.4.0 --conf 
spark.kubernetes.executor.limit.cores=1.0 --conf 
spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
--conf spark.executor.memory=6G --conf 
spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
spark.kubernetes.executor.podNamePrefix=spark-connect --num-executors=10 --conf 
spark.kubernetes.driver.pod.name=spark-connect-pod --conf 
spark.kubernetes.namespace=MY_NAMESPACE && tail -100f 
/opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-pod.out
    image: MY_ECR_REPO/spark-py:3.4-prd
    name: spark-connect-pod
 {code}
The Spark Connect server was successfully launched and I can connect to it 
using pyspark.

 

But when I want to add a Hive metastore config , it won't work.

 
{code:java}
>>> spark = 
>>> SparkSession.builder.remote("sc://spark-connect-svc").config("spark.hive.metastore.uris",
>>>  "thrift://hive-metastore:9083").getOrCreate()
>>> spark.sql("show databases").show()
+-+
|namespace|
+-+
|  default|
+-+{code}
 

 


> Spark Connect client cannot read from Hive metastore
> 
>
> Key: SPARK-43585
> URL: https://issues.apache.org/jira/browse/SPARK-43585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: roland
>Priority: Major
>
> I created a Spark Connect shell in a pod using the following yaml.
> {code:java}
> apiVersion: v1
> kind: Service
> metadata:
>   name: spark-connect-svc
>   namespace: MY_NAMESPACE
> spec:
>   clusterIP: None
>   selector:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-connect-pod
>   namespace: realtime-streaming
>   labels:
>     app: spark-connect-pod
>     podType: spark-connect-driver
> spec:
>   restartPolicy: Never
>   containers:
>   - command:
>     - sh
>     - -c
>     - /opt/spark/sbin/start-connect-server.sh --master 
> k8s://https://MY_API_SERVER:443 --packages 
> org.apache.spark:spark-connect_2.12:3.4.0 --conf 
> spark.kubernetes.executor.limit.cores=1.0 --conf 
> spark.kubernetes.executor.request.cores=1.0 --conf spark.executor.cores=1 
> --conf spark.executor.memory=6G --conf 
> spark.kubernetes.container.image=MY_ECR_REPO/spark:3.4-prd  --conf 
> 

[jira] [Commented] (SPARK-43583) When encryption is enabled on the External Shuffle Service, then processing of push meta requests throws NPE

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724095#comment-17724095
 ] 

Snoot.io commented on SPARK-43583:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/41225

> When encryption is enabled on the External Shuffle Service, then processing 
> of push meta requests throws NPE
> 
>
> Key: SPARK-43583
> URL: https://issues.apache.org/jira/browse/SPARK-43583
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Priority: Major
>
> After enabling support for over-the-wire encryption for spark shuffle 
> services, the meta requests for push-merged blocks fail with this error:
> {code:java}
> java.lang.RuntimeException: java.lang.NullPointerException
>   at 
> org.apache.spark.network.server.AbstractAuthRpcHandler.getMergedBlockMetaReqHandler(AbstractAuthRpcHandler.java:110)
>   at 
> org.apache.spark.network.crypto.AuthRpcHandler.getMergedBlockMetaReqHandler(AuthRpcHandler.java:144)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processMergedBlockMetaRequest(TransportRequestHandler.java:275)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>   at 
> org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43572) Add a test for scrollable result set through thrift server

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724094#comment-17724094
 ] 

Snoot.io commented on SPARK-43572:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/41213

> Add a test for scrollable result set through thrift server
> --
>
> Key: SPARK-43572
> URL: https://issues.apache.org/jira/browse/SPARK-43572
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> improve jdbc server test coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43572) Add a test for scrollable result set through thrift server

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724092#comment-17724092
 ] 

Snoot.io commented on SPARK-43572:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/41213

> Add a test for scrollable result set through thrift server
> --
>
> Key: SPARK-43572
> URL: https://issues.apache.org/jira/browse/SPARK-43572
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> improve jdbc server test coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43519) Bump Parquet to 1.13.1

2023-05-18 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724091#comment-17724091
 ] 

Snoot.io commented on SPARK-43519:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/41178

> Bump Parquet to 1.13.1
> --
>
> Key: SPARK-43519
> URL: https://issues.apache.org/jira/browse/SPARK-43519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43577) Upgrade cyclonedx-maven-plugin to 2.7.9

2023-05-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43577.
-
Resolution: Won't Fix

> Upgrade cyclonedx-maven-plugin to 2.7.9
> ---
>
> Key: SPARK-43577
> URL: https://issues.apache.org/jira/browse/SPARK-43577
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/326
> {noformat}
> Error:  Failed to execute goal 
> org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom (default) on project 
> spark-tags_2.12: Execution default of goal 
> org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom failed: Unsupported class 
> file major version 64 -> [Help 1]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43519) Bump Parquet to 1.13.1

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43519:
--
Summary: Bump Parquet to 1.13.1  (was: Bump Parquet 1.13.1)

> Bump Parquet to 1.13.1
> --
>
> Key: SPARK-43519
> URL: https://issues.apache.org/jira/browse/SPARK-43519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43519) Bump Parquet 1.13.1

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43519.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41178
[https://github.com/apache/spark/pull/41178]

> Bump Parquet 1.13.1
> ---
>
> Key: SPARK-43519
> URL: https://issues.apache.org/jira/browse/SPARK-43519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43519) Bump Parquet 1.13.1

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43519:
-

Assignee: Cheng Pan

> Bump Parquet 1.13.1
> ---
>
> Key: SPARK-43519
> URL: https://issues.apache.org/jira/browse/SPARK-43519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43584) Update some sbt plugins

2023-05-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43584:
---

 Summary: Update some sbt plugins
 Key: SPARK-43584
 URL: https://issues.apache.org/jira/browse/SPARK-43584
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Affects Version/s: 3.3.2

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.2
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:51 AM:
-

gently ping [~yumwang] 

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

 

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT w

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:49 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:



*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order 

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:48 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:



*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

1. The execution plan before I rewrite q95 sql is as follows:
**

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!

 

 

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `o

[jira] [Created] (SPARK-43583) When encryption is enabled on the External Shuffle Service, then processing of push meta requests throws NPE

2023-05-18 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-43583:
-

 Summary: When encryption is enabled on the External Shuffle 
Service, then processing of push meta requests throws NPE
 Key: SPARK-43583
 URL: https://issues.apache.org/jira/browse/SPARK-43583
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: Chandni Singh


After enabling support for over-the-wire encryption for spark shuffle services, 
the meta requests for push-merged blocks fail with this error:
{code:java}
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.spark.network.server.AbstractAuthRpcHandler.getMergedBlockMetaReqHandler(AbstractAuthRpcHandler.java:110)
at 
org.apache.spark.network.crypto.AuthRpcHandler.getMergedBlockMetaReqHandler(AuthRpcHandler.java:144)
at 
org.apache.spark.network.server.TransportRequestHandler.processMergedBlockMetaRequest(TransportRequestHandler.java:275)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:47 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

1. The execution plan before I rewrite q95 sql is as follows:
**

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!

 

 

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

The execution plan before I rewrite q95 sql is as follows:
```Sort merge join```

!sort1.png|width=926,height=473!

```shuffle hash join```

!shuffle1.png|width=921,height=441!

The execution plan after I rewrite q95 sql is as follows:

!sort2.png|width=936,height=496!

the sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!


q95 sql with sort operation added```

set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";


set spark.sql.execution.removeRedundantSorts=false;


WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, 
ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE 
ws1.ws_order_number=ws2.ws_order_number AND 
ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as 
( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM 
web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY 
wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, 
sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total 
net profit ` FROM web_sales ws1 left semi join tmp1 on 
ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on 
ws1.ws_order_number=tmp2.wr_order_number join date_dim on 
ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_

[jira] [Resolved] (SPARK-43581) Upgrade `kubernetes-client` to 6.6.2

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43581.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41223
[https://github.com/apache/spark/pull/41223]

> Upgrade `kubernetes-client` to 6.6.2
> 
>
> Key: SPARK-43581
> URL: https://issues.apache.org/jira/browse/SPARK-43581
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43581) Upgrade `kubernetes-client` to 6.6.2

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43581:
-

Assignee: Dongjoon Hyun

> Upgrade `kubernetes-client` to 6.6.2
> 
>
> Key: SPARK-43581
> URL: https://issues.apache.org/jira/browse/SPARK-43581
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican commented on SPARK-43526:


I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

The execution plan before I rewrite q95 sql is as follows:
```Sort merge join```

!sort1.png|width=926,height=473!

```shuffle hash join```

!shuffle1.png|width=921,height=441!

The execution plan after I rewrite q95 sql is as follows:

!sort2.png|width=936,height=496!

the sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!


q95 sql with sort operation added```

set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";


set spark.sql.execution.removeRedundantSorts=false;


WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, 
ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE 
ws1.ws_order_number=ws2.ws_order_number AND 
ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as 
( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM 
web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY 
wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, 
sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total 
net profit ` FROM web_sales ws1 left semi join tmp1 on 
ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on 
ws1.ws_order_number=tmp2.wr_order_number join date_dim on 
ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on 
ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on 
ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND 
(CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk 
AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND 
ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY 
count(DISTINCT ws_order_number) LIMIT 100

```

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: sort2.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-19-10-43-51-747.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: shuffle1.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: sort1.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43582) Upgrade `sbt-pom-reader` to 2.4.0

2023-05-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43582:
---

 Summary: Upgrade `sbt-pom-reader` to 2.4.0
 Key: SPARK-43582
 URL: https://issues.apache.org/jira/browse/SPARK-43582
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET

2023-05-18 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723900#comment-17723900
 ] 

Jia Fan edited comment on SPARK-43521 at 5/19/23 1:31 AM:
--

I'm working on this!


was (Author: fanjia):
I'm working for this!

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43581) Upgrade `kubernetes-client` to 6.6.2

2023-05-18 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43581:
-

 Summary: Upgrade `kubernetes-client` to 6.6.2
 Key: SPARK-43581
 URL: https://issues.apache.org/jira/browse/SPARK-43581
 Project: Spark
  Issue Type: Bug
  Components: Build, Kubernetes
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43541:
--
Fix Version/s: 3.3.3

> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.3, 3.4.1, 3.5.0
>
>
> This was tested on Spark 3.3.2 and Spark 3.4.0.
> {code}
> Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter 
> with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the 
> following? [`key`].; line 4, pos 7
> {code}
> FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get 
> the query to work with any of these modifications. 
> {code}
> # -- FULL OUTER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
> [`key`].; line 4 pos 7
> # -- INNER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.507s
> # -- NO Filter
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key);
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 1.021s
> # -- ON instead of USING
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.514s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724062#comment-17724062
 ] 

Dongjoon Hyun commented on SPARK-43541:
---

This is backported to branch-3.3 via https://github.com/apache/spark/pull/41221

> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.3, 3.4.1, 3.5.0
>
>
> This was tested on Spark 3.3.2 and Spark 3.4.0.
> {code}
> Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter 
> with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the 
> following? [`key`].; line 4, pos 7
> {code}
> FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get 
> the query to work with any of these modifications. 
> {code}
> # -- FULL OUTER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
> [`key`].; line 4 pos 7
> # -- INNER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.507s
> # -- NO Filter
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key);
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 1.021s
> # -- ON instead of USING
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.514s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43580) Add https://dlcdn.apache.org/ to default_sites of get_preferred_mirrors

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43580.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41222
[https://github.com/apache/spark/pull/41222]

> Add https://dlcdn.apache.org/ to default_sites of get_preferred_mirrors
> ---
>
> Key: SPARK-43580
> URL: https://issues.apache.org/jira/browse/SPARK-43580
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43580) Add https://dlcdn.apache.org/ to default_sites of get_preferred_mirrors

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43580:
-

Assignee: Dongjoon Hyun

> Add https://dlcdn.apache.org/ to default_sites of get_preferred_mirrors
> ---
>
> Key: SPARK-43580
> URL: https://issues.apache.org/jira/browse/SPARK-43580
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43541:
--
Fix Version/s: 3.4.1

> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> This was tested on Spark 3.3.2 and Spark 3.4.0.
> {code}
> Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter 
> with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the 
> following? [`key`].; line 4, pos 7
> {code}
> FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get 
> the query to work with any of these modifications. 
> {code}
> # -- FULL OUTER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
> [`key`].; line 4 pos 7
> # -- INNER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.507s
> # -- NO Filter
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key);
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 1.021s
> # -- ON instead of USING
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.514s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724047#comment-17724047
 ] 

Dongjoon Hyun commented on SPARK-43541:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/41220

> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> This was tested on Spark 3.3.2 and Spark 3.4.0.
> {code}
> Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter 
> with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the 
> following? [`key`].; line 4, pos 7
> {code}
> FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get 
> the query to work with any of these modifications. 
> {code}
> # -- FULL OUTER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
> [`key`].; line 4 pos 7
> # -- INNER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.507s
> # -- NO Filter
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key);
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 1.021s
> # -- ON instead of USING
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.514s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43580) Add https://dlcdn.apache.org/ to default_sites of get_preferred_mirrors

2023-05-18 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-43580:
-

 Summary: Add https://dlcdn.apache.org/ to default_sites of 
get_preferred_mirrors
 Key: SPARK-43580
 URL: https://issues.apache.org/jira/browse/SPARK-43580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43574) Support to set Python executable in workers during runtime

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43574.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41215
[https://github.com/apache/spark/pull/41215]

> Support to set  Python executable in workers during runtime
> ---
>
> Key: SPARK-43574
> URL: https://issues.apache.org/jira/browse/SPARK-43574
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> For Python UDF, we can support runtime configuration to set Python workers. 
> It will create a new Python worker if the configuration is changed.
> This is especially useful when you want to run Spark with different 
> dependencies during runtime. See also 
> https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43574) Support to set Python executable in workers during runtime

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43574:
-

Assignee: Hyukjin Kwon

> Support to set  Python executable in workers during runtime
> ---
>
> Key: SPARK-43574
> URL: https://issues.apache.org/jira/browse/SPARK-43574
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> For Python UDF, we can support runtime configuration to set Python workers. 
> It will create a new Python worker if the configuration is changed.
> This is especially useful when you want to run Spark with different 
> dependencies during runtime. See also 
> https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43575) Exclude duplicated classes from kafka assembly jar

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43575:
-

Assignee: Cheng Pan

> Exclude duplicated classes from kafka assembly jar
> --
>
> Key: SPARK-43575
> URL: https://issues.apache.org/jira/browse/SPARK-43575
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43575) Exclude duplicated classes from kafka assembly jar

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43575.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41217
[https://github.com/apache/spark/pull/41217]

> Exclude duplicated classes from kafka assembly jar
> --
>
> Key: SPARK-43575
> URL: https://issues.apache.org/jira/browse/SPARK-43575
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43358) Cover the resolution of insertion with the SupportsCustomSchemaWrite interface

2023-05-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723999#comment-17723999
 ] 

Dongjoon Hyun commented on SPARK-43358:
---

I converted this from a subtask of SPARK-38334 to a new issue because 
SPARK-38334 is resolved with Fixed Version 3.4.0 already. 

> Cover the resolution of insertion with the SupportsCustomSchemaWrite interface
> --
>
> Key: SPARK-43358
> URL: https://issues.apache.org/jira/browse/SPARK-43358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43358) Cover the resolution of insertion with the SupportsCustomSchemaWrite interface

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43358:
--
Issue Type: Improvement  (was: Bug)

> Cover the resolution of insertion with the SupportsCustomSchemaWrite interface
> --
>
> Key: SPARK-43358
> URL: https://issues.apache.org/jira/browse/SPARK-43358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723998#comment-17723998
 ] 

Dongjoon Hyun commented on SPARK-43313:
---

To [~dtenedor] , SPARK-38334 was resolved with Fixed Version 3.4.0.
I converted this issue from a subtask to an independent issue. And, add a link 
to SPARK-38334 instead.
Please proceed new Jira issues (both bugs or improvements) independently or 
with a new umbrella JIRA.

 

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43358) Cover the resolution of insertion with the SupportsCustomSchemaWrite interface

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43358:
--
Parent: (was: SPARK-38334)
Issue Type: Bug  (was: Sub-task)

> Cover the resolution of insertion with the SupportsCustomSchemaWrite interface
> --
>
> Key: SPARK-43358
> URL: https://issues.apache.org/jira/browse/SPARK-43358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43313:
--
Parent: (was: SPARK-38334)
Issue Type: Bug  (was: Sub-task)

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43313:
--
Affects Version/s: 3.4.0
   (was: 3.5.0)

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723997#comment-17723997
 ] 

Dongjoon Hyun commented on SPARK-43313:
---

This is reverted from branch-3.4 via 
[https://github.com/apache/spark/commit/079594ae976b377459ad09d864106734ef65c32d]

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43313:
--
Fix Version/s: (was: 3.4.1)

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43313) Adding missing default values for MERGE INSERT actions

2023-05-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43313:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Adding missing default values for MERGE INSERT actions
> --
>
> Key: SPARK-43313
> URL: https://issues.apache.org/jira/browse/SPARK-43313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43541) Incorrect column resolution on FULL OUTER JOIN with USING

2023-05-18 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43541.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41204
[https://github.com/apache/spark/pull/41204]

> Incorrect column resolution on FULL OUTER JOIN with USING
> -
>
> Key: SPARK-43541
> URL: https://issues.apache.org/jira/browse/SPARK-43541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0
>
>
> This was tested on Spark 3.3.2 and Spark 3.4.0.
> {code}
> Causes [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter 
> with name `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the 
> following? [`key`].; line 4, pos 7
> {code}
> FULL OUTER JOIN with USING and/or the WHERE seems relevant since I can get 
> the query to work with any of these modifications. 
> {code}
> # -- FULL OUTER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `aws_dbr_a`.`key` cannot be resolved. Did you mean one of the following? 
> [`key`].; line 4 pos 7
> # -- INNER JOIN
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>JOIN gcp_pro_b USING (key)
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.507s
> # -- NO Filter
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b USING (key);
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 1.021s
> # -- ON instead of USING
>WITH
>aws_dbr_a AS (select key from values ('a') t(key)),
>gcp_pro_b AS (select key from values ('a') t(key))
>SELECT aws_dbr_a.key
>FROM aws_dbr_a
>FULL OUTER JOIN gcp_pro_b ON aws_dbr_a.key = gcp_pro_b.key
>WHERE aws_dbr_a.key NOT LIKE 'spark.clusterUsageTags.%';
> +-+
> | key |
> |-|
> | a   |
> +-+
> 1 row in set
> Time: 0.514s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43579) Cache the converter between Arrow and pandas for reuse

2023-05-18 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-43579:


 Summary: Cache the converter between Arrow and pandas for reuse
 Key: SPARK-43579
 URL: https://issues.apache.org/jira/browse/SPARK-43579
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43136) Scala mapGroup, coGroup

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723976#comment-17723976
 ] 

Hudson commented on SPARK-43136:


User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/40980

> Scala mapGroup, coGroup
> ---
>
> Key: SPARK-43136
> URL: https://issues.apache.org/jira/browse/SPARK-43136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.5.0
>
>
> Adding Basics of Dataset#groupByKey -> KeyValueGroupedDataset support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723974#comment-17723974
 ] 

Hudson commented on SPARK-43511:


User 'bogao007' has created a pull request for this issue:
https://github.com/apache/spark/pull/40959

> Implemented State APIs for Spark Connect Scala
> --
>
> Key: SPARK-43511
> URL: https://issues.apache.org/jira/browse/SPARK-43511
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Priority: Major
>
> Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
> Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43317) Support combine adjacent aggregation

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723977#comment-17723977
 ] 

Hudson commented on SPARK-43317:


User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40990

> Support combine adjacent aggregation
> 
>
> Key: SPARK-43317
> URL: https://issues.apache.org/jira/browse/SPARK-43317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> If there have adjacent aggregation with Partial and Final mode, we can 
> combine them to Complete mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43267) Support creating data frame from a Postgres table that contains user-defined array column

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723972#comment-17723972
 ] 

Hudson commented on SPARK-43267:


User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/40953

> Support creating data frame from a Postgres table that contains user-defined 
> array column
> -
>
> Key: SPARK-43267
> URL: https://issues.apache.org/jira/browse/SPARK-43267
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0, 3.3.2
>Reporter: Sifan Huang
>Priority: Blocker
>
> Spark SQL now doesn’t support creating data frame from a Postgres table that 
> contains user-defined array column. However, it used to allow such type 
> before the Postgres JDBC commit 
> (https://github.com/pgjdbc/pgjdbc/commit/375cb3795c3330f9434cee9353f0791b86125914).
>  The previous behavior was to handle user-defined array column as String.
> Given:
>  * Postgres table with user-defined array column
>  * Function: DataFrameReader.jdbc - 
> https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/DataFrameReader.html#jdbc-java.lang.String-java.lang.String-java.util.Properties-
> Results:
>  * Exception “java.sql.SQLException: Unsupported type ARRAY” is thrown
> Expectation after the change:
>  * Function call succeeds
>  * User-defined array is converted as a string in Spark DataFrame
> Suggested fix:
>  * Update “getCatalystType” function in “PostgresDialect” as
>  ** 
> {code:java}
> val catalystType = toCatalystType(typeName.drop(1), size, 
> scale).map(ArrayType(_))
> if (catalystType.isEmpty) Some(StringType) else catalystType{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39420) Support ANALYZE TABLE on v2 tables

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723975#comment-17723975
 ] 

Hudson commented on SPARK-39420:


User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/4

> Support ANALYZE TABLE on v2 tables
> --
>
> Key: SPARK-39420
> URL: https://issues.apache.org/jira/browse/SPARK-39420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Felipe
>Priority: Major
>
> According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE 
> TABLE in Delta, we need to add the missing APIs in Spark to allow a data 
> source to report the file set to calculate the stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43266) Move MergeScalarSubqueries to spark-sql

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723973#comment-17723973
 ] 

Hudson commented on SPARK-43266:


User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40932

> Move MergeScalarSubqueries to spark-sql
> ---
>
> Key: SPARK-43266
> URL: https://issues.apache.org/jira/browse/SPARK-43266
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Minor
>
> This is a step to make SPARK-40193 easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43180) Upgrade mypy package in python pyspark dependencies

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723971#comment-17723971
 ] 

Hudson commented on SPARK-43180:


User 'aimtsou' has created a pull request for this issue:
https://github.com/apache/spark/pull/40960

> Upgrade mypy package in python pyspark dependencies
> ---
>
> Key: SPARK-43180
> URL: https://issues.apache.org/jira/browse/SPARK-43180
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, python
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: Aimilios Tsouvelekakis
>Priority: Minor
>
> While working on this issue, I noticed that mypy is in a quite older version. 
> Running it on version 1.2.0 does not show any problems with the current tests.
> This will allow me to get better support from mypy authors on why mypy fails 
> on the typing.io issue linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-05-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723970#comment-17723970
 ] 

Hudson commented on SPARK-43288:


User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/40963

> DataSourceV2: CREATE TABLE LIKE
> ---
>
> Key: SPARK-43288
> URL: https://issues.apache.org/jira/browse/SPARK-43288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43578) Protobuf : Use broadcast to distribute descriptor file content

2023-05-18 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-43578:


 Summary: Protobuf : Use broadcast to distribute descriptor file 
content
 Key: SPARK-43578
 URL: https://issues.apache.org/jira/browse/SPARK-43578
 Project: Spark
  Issue Type: Task
  Components: Protobuf
Affects Versions: 3.5.0
Reporter: Raghu Angadi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43577) Upgrade cyclonedx-maven-plugin to 2.7.9

2023-05-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723957#comment-17723957
 ] 

Yuming Wang commented on SPARK-43577:
-

https://github.com/apache/spark/pull/41219

> Upgrade cyclonedx-maven-plugin to 2.7.9
> ---
>
> Key: SPARK-43577
> URL: https://issues.apache.org/jira/browse/SPARK-43577
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/326
> {noformat}
> Error:  Failed to execute goal 
> org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom (default) on project 
> spark-tags_2.12: Execution default of goal 
> org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom failed: Unsupported class 
> file major version 64 -> [Help 1]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43577) Upgrade cyclonedx-maven-plugin to 2.7.9

2023-05-18 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-43577:
---

 Summary: Upgrade cyclonedx-maven-plugin to 2.7.9
 Key: SPARK-43577
 URL: https://issues.apache.org/jira/browse/SPARK-43577
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yuming Wang



https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/326

{noformat}
Error:  Failed to execute goal 
org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom (default) on project 
spark-tags_2.12: Execution default of goal 
org.cyclonedx:cyclonedx-maven-plugin:2.7.6:makeBom failed: Unsupported class 
file major version 64 -> [Help 1]
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43576) Remove unused declarations from Core module

2023-05-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43576:
---

 Summary: Remove unused declarations from Core module
 Key: SPARK-43576
 URL: https://issues.apache.org/jira/browse/SPARK-43576
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET

2023-05-18 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723900#comment-17723900
 ] 

Jia Fan commented on SPARK-43521:
-

I'm working for this!

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43575) Exclude duplicated classes from kafka assembly jar

2023-05-18 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-43575:
-

 Summary: Exclude duplicated classes from kafka assembly jar
 Key: SPARK-43575
 URL: https://issues.apache.org/jira/browse/SPARK-43575
 Project: Spark
  Issue Type: Improvement
  Components: Build, Structured Streaming
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43549) Convert `_LEGACY_ERROR_TEMP_0036` to INVALID_SQL_SYNTAX

2023-05-18 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43549:

Summary: Convert `_LEGACY_ERROR_TEMP_0036` to INVALID_SQL_SYNTAX  (was: 
Assign a name to the error class _LEGACY_ERROR_TEMP_0036)

> Convert `_LEGACY_ERROR_TEMP_0036` to INVALID_SQL_SYNTAX
> ---
>
> Key: SPARK-43549
> URL: https://issues.apache.org/jira/browse/SPARK-43549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43574) Support to set Python executable in workers during runtime

2023-05-18 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-43574:


 Summary: Support to set  Python executable in workers during 
runtime
 Key: SPARK-43574
 URL: https://issues.apache.org/jira/browse/SPARK-43574
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


For Python UDF, we can support runtime configuration to set Python workers. It 
will create a new Python worker if the configuration is changed.

This is especially useful when you want to run Spark with different 
dependencies during runtime. See also 
https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43573) Make SparkBuilder could config the heap size of test JVM.

2023-05-18 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43573:
---
Description: 
build/sbt "sql/Test/runMain  --dsdgenDir  --location  
--scaleFactor 1" causes OOM, if the scaleFactor big enough.

{code:java}
[info] 16:43:41.618 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.627 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.646 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_10_610
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.656 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_14_614
[info] 16:43:41.656 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.668 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_02_602
[info] 16:43:41.668 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[error] Exception in thread "main" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 13 in stage 6.0 failed 1 times, most recent fail
ure: Lost task 13.0 in stage 6.0 (TID 613) 
(ip-172-31-27-53.cn-northwest-1.compute.internal executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_
FAILED] Task failed while writing rows to 
file:/home/ubuntu/tpcdsdata/test/catalog_sales.
[error] at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
[error] at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
[error] at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[error] at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[error] at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:139)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1487)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:750)
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[error] Driver stacktrace:
[error] at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2815)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2751)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2750)
[error] at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[error] at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[error] at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[error] at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2750)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218)
[error] at scala

[jira] [Updated] (SPARK-43573) Make SparkBuilder could config the heap size of test JVM.

2023-05-18 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43573:
---
Description: 

{code:java}
build/sbt "sql/Test/runMain  --dsdgenDir  --location  
--scaleFactor 1"
{code}
 causes OOM, if the scaleFactor big enough.

{code:java}
[info] 16:43:41.618 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.627 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.646 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_10_610
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.656 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_14_614
[info] 16:43:41.656 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.668 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_02_602
[info] 16:43:41.668 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[error] Exception in thread "main" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 13 in stage 6.0 failed 1 times, most recent fail
ure: Lost task 13.0 in stage 6.0 (TID 613) 
(ip-172-31-27-53.cn-northwest-1.compute.internal executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_
FAILED] Task failed while writing rows to 
file:/home/ubuntu/tpcdsdata/test/catalog_sales.
[error] at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
[error] at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
[error] at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[error] at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[error] at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:139)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1487)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:750)
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[error] Driver stacktrace:
[error] at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2815)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2751)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2750)
[error] at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[error] at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[error] at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[error] at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2750)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218)
[er

[jira] [Commented] (SPARK-43573) Make SparkBuilder could config the heap size of test JVM.

2023-05-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723861#comment-17723861
 ] 

ASF GitHub Bot commented on SPARK-43573:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41212

> Make SparkBuilder could config the heap size of test JVM.
> -
>
> Key: SPARK-43573
> URL: https://issues.apache.org/jira/browse/SPARK-43573
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> [info] 16:43:41.618 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [info] 16:43:41.627 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [info] 16:43:41.646 WARN 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
> file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
> rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_10_610
> [info] 16:43:41.647 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [info] 16:43:41.647 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [info] 16:43:41.656 WARN 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
> file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
> rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_14_614
> [info] 16:43:41.656 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [info] 16:43:41.668 WARN 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
> file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
> rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_02_602
> [info] 16:43:41.668 ERROR 
> org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
> job_202305181633205732292221634890857_0006 aborted.
> [error] Exception in thread "main" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 13 in stage 6.0 failed 1 times, most 
> recent fail
> ure: Lost task 13.0 in stage 6.0 (TID 613) 
> (ip-172-31-27-53.cn-northwest-1.compute.internal executor driver): 
> org.apache.spark.SparkException: [TASK_WRITE_
> FAILED] Task failed while writing rows to 
> file:/home/ubuntu/tpcdsdata/test/catalog_sales.
> [error] at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
> [error] at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
> [error] at 
> org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
> [error] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
> [error] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
> [error] at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [error] at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
> [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
> [error] at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
> [error] at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
> [error] at org.apache.spark.scheduler.Task.run(Task.scala:139)
> [error] at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
> [error] at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1487)
> [error] at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
> [error] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [error] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [error] at java.lang.Thread.run(Thread.java:750)
> [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> [error] Driver stacktrace:
> [error] at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2815)
> [error] at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2751)
> [error] at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2750)
> [error] at 
> s

[jira] [Updated] (SPARK-43573) Make SparkBuilder could config the heap size of test JVM.

2023-05-18 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43573:
---
Description: 


{code:java}
[info] 16:43:41.618 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.627 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.646 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_10_610
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.647 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.656 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_14_614
[info] 16:43:41.656 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[info] 16:43:41.668 WARN 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete 
file:/home/ubuntu/tpcdsdata/test/catalog_sales/_tempo
rary/0/_temporary/attempt_202305181633205732292221634890857_0006_m_02_602
[info] 16:43:41.668 ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter: Job 
job_202305181633205732292221634890857_0006 aborted.
[error] Exception in thread "main" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 13 in stage 6.0 failed 1 times, most recent fail
ure: Lost task 13.0 in stage 6.0 (TID 613) 
(ip-172-31-27-53.cn-northwest-1.compute.internal executor driver): 
org.apache.spark.SparkException: [TASK_WRITE_
FAILED] Task failed while writing rows to 
file:/home/ubuntu/tpcdsdata/test/catalog_sales.
[error] at 
org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
[error] at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420)
[error] at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
[error] at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
[error] at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:139)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1487)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:750)
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[error] Driver stacktrace:
[error] at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2815)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2751)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2750)
[error] at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[error] at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[error] at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[error] at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2750)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218)
[error] at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218)
[error] at scala.Option.foreach(Option.scala:407)
[error] at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(

[jira] [Updated] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0036

2023-05-18 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43549:

Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_0036  (was: 
Assign a name to the error class _LEGACY_ERROR_TEMP_0035)

> Assign a name to the error class _LEGACY_ERROR_TEMP_0036
> 
>
> Key: SPARK-43549
> URL: https://issues.apache.org/jira/browse/SPARK-43549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035

2023-05-18 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan reopened SPARK-43549:
-

> Assign a name to the error class _LEGACY_ERROR_TEMP_0035
> 
>
> Key: SPARK-43549
> URL: https://issues.apache.org/jira/browse/SPARK-43549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43573) Make SparkBuilder could config the heap size of test JVM.

2023-05-18 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43573:
--

 Summary: Make SparkBuilder could config the heap size of test JVM.
 Key: SPARK-43573
 URL: https://issues.apache.org/jira/browse/SPARK-43573
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2023-05-18 Thread Shuaipeng Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723777#comment-17723777
 ] 

Shuaipeng Lee edited comment on SPARK-40964 at 5/18/23 8:06 AM:


Thanks for your commits. I rebuild the hadoop-client-api and can start history 
server seccessfully.

I change the pom.xml of 
hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following 
config


javax/servlet/
${shaded.dependency.prefix}.javax.servlet.

**/pom.xml




build hadoop-client-api

 

mvn package -DskipTests

 


was (Author: bigboy001):
Thanks for your commits. I rebuild the hadoop-client-api and can start history 
server seccessfully.

I change the pom.xml of 
hadoop-3.3.1-src/hadoop-client-modules/hadoop-client-api, delete following 
config



javax/servlet/
${shaded.dependency.prefix}.javax.servlet.

**/pom.xml




build hadoop-client-api

 

mvn package -DskipTests

 

> Cannot run spark history server with shaded hadoop jar
> --
>
> Key: SPARK-40964
> URL: https://issues.apache.org/jira/browse/SPARK-40964
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Priority: Major
>
> Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
> If you try to start Spark History Server with shaded client jars and enable 
> security using 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter, you 
> will meet following exception.
> {code}
> # spark-env.sh
> export 
> SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
>  
> -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
> {code}
> {code}
> # spark history server's out file
> 22/10/27 15:29:48 INFO AbstractConnector: Started 
> ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
> 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' 
> on port 18081.
> 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter
> 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
> java.lang.IllegalStateException: class 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter is not 
> a javax.servlet.Filter
> at 
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
> at 
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
> at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
> at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
> at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
> at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
> at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
> at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
> at 
> org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
> at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
> at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> {code}
> I think "AuthenticationFilter" in the shaded jar imports 
> "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".
> {code}
> ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
> Binary file hadoop-client-runtime-3.3.1.jar matches
> {code}
> It causes the exception I mentioned.
> I'm not sure what is the best answer.
> Workaround is not to us

[jira] [Created] (SPARK-43572) Add a test for scrollable result set through thrift server

2023-05-18 Thread Kent Yao (Jira)
Kent Yao created SPARK-43572:


 Summary: Add a test for scrollable result set through thrift server
 Key: SPARK-43572
 URL: https://issues.apache.org/jira/browse/SPARK-43572
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.5.0
Reporter: Kent Yao


improve jdbc server test coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43549) Assign a name to the error class _LEGACY_ERROR_TEMP_0035

2023-05-18 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan resolved SPARK-43549.
-
Resolution: Duplicate

> Assign a name to the error class _LEGACY_ERROR_TEMP_0035
> 
>
> Key: SPARK-43549
> URL: https://issues.apache.org/jira/browse/SPARK-43549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >