date:20220921

[jira] [Resolved] (SPARK-40522) Upgrade Apache Kafka from 3.2.1 to 3.2.3

2022-09-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40522.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37958
[https://github.com/apache/spark/pull/37958]

> Upgrade Apache Kafka from 3.2.1 to 3.2.3
> 
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40522) Upgrade Apache Kafka from 3.2.1 to 3.2.3

2022-09-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40522:
-

Assignee: Bjørn Jørgensen

> Upgrade Apache Kafka from 3.2.1 to 3.2.3
> 
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-21 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40327.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37948
[https://github.com/apache/spark/pull/37948]

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-21 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40327:
-

Assignee: Ruifeng Zheng

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40434:


Assignee: Jungtaek Lim

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608067#comment-17608067
 ] 

Apache Spark commented on SPARK-40434:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37964

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0
>
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40434.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37893
[https://github.com/apache/spark/pull/37893]

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0
>
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-21 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40487.
-
Fix Version/s: 3.4.0
 Assignee: Xingchao, Zhang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37930

> Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
> ---
>
> Key: SPARK-40487
> URL: https://issues.apache.org/jira/browse/SPARK-40487
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xingchao, Zhang
>Assignee: Xingchao, Zhang
>Priority: Major
> Fix For: 3.4.0
>
>
> The 'Part 1' and 'Part 2' could run in parallel
> {code:java}
>   /**
>* The implementation for these joins:
>*
>*   LeftOuter with BuildLeft
>*   RightOuter with BuildRight
>*   FullOuter
>*/
>   private def defaultJoin(relation: Broadcast[Array[InternalRow]]): 
> RDD[InternalRow] = {
> val streamRdd = streamed.execute()
> // Part 1
> val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, 
> relation)
> val notMatchedBroadcastRows: Seq[InternalRow] = {
>   val nulls = new GenericInternalRow(streamed.output.size)
>   val buf: CompactBuffer[InternalRow] = new CompactBuffer()
>   val joinedRow = new JoinedRow
>   joinedRow.withLeft(nulls)
>   var i = 0
>   val buildRows = relation.value
>   while (i < buildRows.length) {
> if (!matchedBroadcastRows.get(i)) {
>   buf += joinedRow.withRight(buildRows(i)).copy()
> }
> i += 1
>   }
>   buf
> }
> // Part 2
> val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter =>
>   val buildRows = relation.value
>   val joinedRow = new JoinedRow
>   val nulls = new GenericInternalRow(broadcast.output.size)
>   streamedIter.flatMap { streamedRow =>
> var i = 0
> var foundMatch = false
> val matchedRows = new CompactBuffer[InternalRow]
> while (i < buildRows.length) {
>   if (boundCondition(joinedRow(streamedRow, buildRows(i {
> matchedRows += joinedRow.copy()
> foundMatch = true
>   }
>   i += 1
> }
> if (!foundMatch && joinType == FullOuter) {
>   matchedRows += joinedRow(streamedRow, nulls).copy()
> }
> matchedRows.iterator
>   }
> }
> // Union
> sparkContext.union(
>   matchedStreamRows,
>   sparkContext.makeRDD(notMatchedBroadcastRows)
> )
>   }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608042#comment-17608042
 ] 

Apache Spark commented on SPARK-40490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37963

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608043#comment-17608043
 ] 

Apache Spark commented on SPARK-40490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37962

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40526:

Description: 
release notes:

[https://github.com/scala/scala/releases/tag/v2.13.9]

!image-2022-09-22-10-53-10-579.png!

  was:
release notes:

[https://github.com/scala/scala/releases/tag/v2.13.9]

!image-2022-09-22-10-51-33-638.png!


> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40526:


Assignee: (was: Apache Spark)

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608041#comment-17608041
 ] 

Apache Spark commented on SPARK-40526:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37961

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40526:


Assignee: Apache Spark

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40526:

Attachment: image-2022-09-22-10-53-10-579.png

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608040#comment-17608040
 ] 

Apache Spark commented on SPARK-40526:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37961

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40526
> URL: https://issues.apache.org/jira/browse/SPARK-40526
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-09-22-10-53-10-579.png
>
>
> release notes:
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> !image-2022-09-22-10-53-10-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40526) Upgrade Scala to 2.13.9

2022-09-21 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-40526:
---

 Summary: Upgrade Scala to 2.13.9
 Key: SPARK-40526
 URL: https://issues.apache.org/jira/browse/SPARK-40526
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: BingKun Pan


release notes:

[https://github.com/scala/scala/releases/tag/v2.13.9]

!image-2022-09-22-10-51-33-638.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40385) Classes with companion object constructor fails interpreted path

2022-09-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40385:
-
Fix Version/s: 3.3.2
   (was: 3.3.1)

> Classes with companion object constructor fails interpreted path
> 
>
> Key: SPARK-40385
> URL: https://issues.apache.org/jira/browse/SPARK-40385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> The Encoder implemented in SPARK-8288 for classes with only a companion 
> object constructor fails when using the interpreted path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40385) Classes with companion object constructor fails interpreted path

2022-09-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40385.
--
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37837
[https://github.com/apache/spark/pull/37837]

> Classes with companion object constructor fails interpreted path
> 
>
> Key: SPARK-40385
> URL: https://issues.apache.org/jira/browse/SPARK-40385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> The Encoder implemented in SPARK-8288 for classes with only a companion 
> object constructor fails when using the interpreted path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40385) Classes with companion object constructor fails interpreted path

2022-09-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40385:


Assignee: Emil Ejbyfeldt

> Classes with companion object constructor fails interpreted path
> 
>
> Key: SPARK-40385
> URL: https://issues.apache.org/jira/browse/SPARK-40385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>
> The Encoder implemented in SPARK-8288 for classes with only a companion 
> object constructor fails when using the interpreted path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40525:
-
Description: 
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}

  was:
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}


> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
> 22/09/19

[jira] [Updated] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40525:
-
Description: 
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}

  was:
h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:

 
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}


> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:}}{}}}
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT)

[jira] [Created] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2022-09-21 Thread xsys (Jira)

xsys created SPARK-40525:


 Summary: Floating-point value with an INT/BYTE/SHORT/LONG type 
errors out in DataFrame but evaluates to a rounded value in SparkSQL
 Key: SPARK-40525
 URL: https://issues.apache.org/jira/browse/SPARK-40525
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
expectedly errors out. However, it is evaluated to a rounded value {{1}} if the 
value is inserted into the table via {{{}spark-sql{}}}.
h3. Steps to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:{{{}{}}}
{code:java}
$SPARK_HOME/bin/spark-sql {code}
Execute the following:

 
{code:java}
spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.216 seconds
spark-sql> insert into int_floating_point_vals select 1.1;
Time taken: 1.747 seconds
spark-sql> select * from int_floating_point_vals;
1
Time taken: 0.518 seconds, Fetched 1 row(s){code}
 
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination ({{{}INT{}}} and 
{{{}1.1{}}}).
h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned value correctly raises an exception:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
Execute the following:
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(1.1)))
val schema = new StructType().add(StructField("c1", IntegerType, true))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") 
{code}
The following exception is raised:
{code:java}
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of int{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38819) Run Pandas on Spark with Pandas 1.4.x

2022-09-21 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543450#comment-17543450
 ] 

Yikun Jiang edited comment on SPARK-38819 at 9/22/22 1:37 AM:
--

All UT / doctest had been shown in here and submited the PR. It's really hard 
way to keep pandas compatible completely, and there's no good way to do it 
besides done it one by one.

Note that this is only mean current PS test failures had been fixed. There are 
quite a lot pandas new features or bugfixes, that we haven't synced.

What should we do with left? I think it might be:
 * Priority 0: Fix all existing ut/doctest (what we do in this umbrella)
 * Priority 1: Follow the main features/breaking changes:
 ** 
[https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.4.2.html#enhancements|https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.3.0.html#enhancements]
 ** 
[https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.4.2.html#notable-bug-fixes|https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.3.0.html#notable-bug-fixes]
 * Priority 2: Demand trigger only when somebody rasie the new feature/bug in 
jira
 * Priority 3: Follow the all main features and bugfix (impossible in some 
level)

cc [~hyukjin.kwon]  [~XinrongM] [~itholic] [~podongfeng]  Any idea?


was (Author: yikunkero):
All UT / doctest had been shown in here and submited the PR. It's really hard 
way to keep pandas compatible completely, and there's no good way to do it 
besides done it one by one.

Note that this is only mean current PS test failures had been fixed. There are 
quite a lot pandas new features or bugfixes, that we haven't synced.

What should we do with left? I think it might be:
 * Priority 0: Fix all existing ut/doctest (what we do in this umbrella)
 * Priority 1: Follow the main features/breaking changes:
 ** 
[https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.3.0.html#enhancements]
 ** 
[https://pandas.pydata.org/pandas-docs/version/1.4.2/whatsnew/v1.3.0.html#notable-bug-fixes]
 * Priority 2: Demand trigger only when somebody rasie the new feature/bug in 
jira
 * Priority 3: Follow the all main features and bugfix (impossible in some 
level)

cc [~hyukjin.kwon]  [~XinrongM] [~itholic] [~podongfeng]  Any idea?

> Run Pandas on Spark with Pandas 1.4.x
> -
>
> Key: SPARK-38819
> URL: https://issues.apache.org/jira/browse/SPARK-38819
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> This is a umbrella to track issues when pandas upgrade to 1.4.x
>  
> I disable the fast-failed in test, 19 failed:
> [https://github.com/Yikun/spark/pull/88/checks?check_run_id=5873627048]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39200) Stream is corrupted Exception while fetching the blocks from fallback storage system

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608021#comment-17608021
 ] 

Apache Spark commented on SPARK-39200:
--

User 'ukby1234' has created a pull request for this issue:
https://github.com/apache/spark/pull/37960

> Stream is corrupted Exception while fetching the blocks from fallback storage 
> system
> 
>
> Key: SPARK-39200
> URL: https://issues.apache.org/jira/browse/SPARK-39200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Rajendra Gujja
>Priority: Major
>
> When executor decommissioning and fallback storage is enabled - the shuffle 
> reads are failing with `FetchFailedException: Stream is corrupted` 
> ref: https://issues.apache.org/jira/browse/SPARK-18105 (search for 
> decommission)
>  
> This is happening when the shuffle block is bigger than `inputstream.read` 
> can read in one attempt. The code path is not reading the block fully 
> (`readFully`) and the partial read is causing the exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39200) Stream is corrupted Exception while fetching the blocks from fallback storage system

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39200:


Assignee: Apache Spark

> Stream is corrupted Exception while fetching the blocks from fallback storage 
> system
> 
>
> Key: SPARK-39200
> URL: https://issues.apache.org/jira/browse/SPARK-39200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Rajendra Gujja
>Assignee: Apache Spark
>Priority: Major
>
> When executor decommissioning and fallback storage is enabled - the shuffle 
> reads are failing with `FetchFailedException: Stream is corrupted` 
> ref: https://issues.apache.org/jira/browse/SPARK-18105 (search for 
> decommission)
>  
> This is happening when the shuffle block is bigger than `inputstream.read` 
> can read in one attempt. The code path is not reading the block fully 
> (`readFully`) and the partial read is causing the exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39200) Stream is corrupted Exception while fetching the blocks from fallback storage system

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39200:


Assignee: (was: Apache Spark)

> Stream is corrupted Exception while fetching the blocks from fallback storage 
> system
> 
>
> Key: SPARK-39200
> URL: https://issues.apache.org/jira/browse/SPARK-39200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Rajendra Gujja
>Priority: Major
>
> When executor decommissioning and fallback storage is enabled - the shuffle 
> reads are failing with `FetchFailedException: Stream is corrupted` 
> ref: https://issues.apache.org/jira/browse/SPARK-18105 (search for 
> decommission)
>  
> This is happening when the shuffle block is bigger than `inputstream.read` 
> can read in one attempt. The code path is not reading the block fully 
> (`readFully`) and the partial read is causing the exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608012#comment-17608012
 ] 

Apache Spark commented on SPARK-40142:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37959

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608011#comment-17608011
 ] 

Apache Spark commented on SPARK-40142:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37959

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39200) Stream is corrupted Exception while fetching the blocks from fallback storage system

2022-09-21 Thread Frank Yin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608008#comment-17608008
 ] 

Frank Yin commented on SPARK-39200:
---

We've seen this exception as well. Is there a patch coming? 

> Stream is corrupted Exception while fetching the blocks from fallback storage 
> system
> 
>
> Key: SPARK-39200
> URL: https://issues.apache.org/jira/browse/SPARK-39200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Rajendra Gujja
>Priority: Major
>
> When executor decommissioning and fallback storage is enabled - the shuffle 
> reads are failing with `FetchFailedException: Stream is corrupted` 
> ref: https://issues.apache.org/jira/browse/SPARK-18105 (search for 
> decommission)
>  
> This is happening when the shuffle block is bigger than `inputstream.read` 
> can read in one attempt. The code path is not reading the block fully 
> (`readFully`) and the partial read is causing the exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40303) The performance will be worse after codegen

2022-09-21 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40303.
-
Resolution: Won't Fix

Issue fixed by [JDK-8159720|https://bugs.openjdk.org/browse/JDK-8159720].

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: TestApiBenchmark.scala, TestApis.java, 
> TestParameters.java
>
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40524) local mode with resource scheduling can hang

2022-09-21 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-40524:
--
Summary: local mode with resource scheduling can hang  (was: local mode 
with resource scheduling should just fail)

> local mode with resource scheduling can hang
> 
>
> Key: SPARK-40524
> URL: https://issues.apache.org/jira/browse/SPARK-40524
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> If you try to run spark in local mode and request custom resources like 
> GPU's, Spark will hang.  Resource scheduling isn't supported in local mode so 
> just removing the request for resources fixes the issue, but its really 
> confusing to users since it just hangs.
>  
> ie to reproduce:
> spark-sql --conf spark.executor.resource.gpu.amount=1 --conf 
> spark.task.resource.gpu.amount=1
> Run:
> select 1
> result: hangs
> To fix run:
> spark-sql 
>  
> spark-sql> select 1;
> 1
> Time taken: 2.853 seconds, Fetched 1 row(s)
>  
> It would be nice if we just fail to start or threw an exception when using 
> those options in local mode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40524) local mode with resource scheduling should just fail

2022-09-21 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-40524:
-

 Summary: local mode with resource scheduling should just fail
 Key: SPARK-40524
 URL: https://issues.apache.org/jira/browse/SPARK-40524
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


If you try to run spark in local mode and request custom resources like GPU's, 
Spark will hang.  Resource scheduling isn't supported in local mode so just 
removing the request for resources fixes the issue, but its really confusing to 
users since it just hangs.

 

ie to reproduce:

spark-sql --conf spark.executor.resource.gpu.amount=1 --conf 
spark.task.resource.gpu.amount=1

Run:

select 1

result: hangs

To fix run:

spark-sql 

 

spark-sql> select 1;
1
Time taken: 2.853 seconds, Fetched 1 row(s)

 

It would be nice if we just fail to start or threw an exception when using 
those options in local mode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40523) pyspark dataframe methods (i.e. show()) won't run in VSCode debug console

2022-09-21 Thread Eli (Jira)

Eli created SPARK-40523:
---

 Summary: pyspark dataframe methods (i.e. show()) won't run in 
VSCode debug console
 Key: SPARK-40523
 URL: https://issues.apache.org/jira/browse/SPARK-40523
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Eli


when debugging pyspark codes in VSCode using the python debugger, when 
execution is paused on a breakpoint, you can issue statement/expression in 
VSCode debug console to check dataframe's content, etc.

However, those statement related to df operation always gets stuck and then 
debugger throws a timeout error in the debug console. 

 

This issue is initially reported in stack overflow: 
[https://stackoverflow.com/questions/65739467/pyspark-dataframe-methods-i-e-show-can-not-be-printed-in-vs-code-debug-cons]

there are some workaround suggestions as well in that thread.

OS: win 10
VSCode: 1.64.0

Python extension in VScode: v2022.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40457) upgrade jackson data mapper to latest

2022-09-21 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607949#comment-17607949
 ] 

Bjørn Jørgensen edited comment on SPARK-40457 at 9/21/22 7:49 PM:
--

[~bilna123]
Yes, there are no version to upgrade to 
https://github.com/bjornjorgensen/spark/security/dependabot/1 and it's for 
hadoop version 2. 

But do you find a new version and can you test it with hadoop version 2? 

Edit:
Have a look at 

https://issues.apache.org/jira/browse/HADOOP-17225?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17305360#comment-17305360
 



was (Author: bjornjorgensen):
[~bilna123]
Yes, there are no version to upgrade to 
https://github.com/bjornjorgensen/spark/security/dependabot/1 and it's for 
hadoop version 2. 

But do you find a new version and can you test it with hadoop version 2? 

> upgrade jackson data mapper to latest 
> --
>
> Key: SPARK-40457
> URL: https://issues.apache.org/jira/browse/SPARK-40457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bilna
>Priority: Major
>
> Upgrade  jackson-mapper-asl to the latest to resolve CVE-2019-10172



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest

2022-09-21 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607949#comment-17607949
 ] 

Bjørn Jørgensen commented on SPARK-40457:
-

[~bilna123]
Yes, there are no version to upgrade to 
https://github.com/bjornjorgensen/spark/security/dependabot/1 and it's for 
hadoop version 2. 

But do you find a new version and can you test it with hadoop version 2? 

> upgrade jackson data mapper to latest 
> --
>
> Key: SPARK-40457
> URL: https://issues.apache.org/jira/browse/SPARK-40457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bilna
>Priority: Major
>
> Upgrade  jackson-mapper-asl to the latest to resolve CVE-2019-10172



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40522) Upgrade Apache Kafka from 3.2.1 to 3.2.3

2022-09-21 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40522:

Summary: Upgrade Apache Kafka from 3.2.1 to 3.2.3  (was: Upgrade kafka from 
3.2.1 to 3.2.3)

> Upgrade Apache Kafka from 3.2.1 to 3.2.3
> 
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.3

2022-09-21 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40522:

Summary: Upgrade kafka from 3.2.1 to 3.2.3  (was: Upgrade kafka from 3.2.1 
to 3.2.2)

> Upgrade kafka from 3.2.1 to 3.2.3
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607918#comment-17607918
 ] 

Dongjoon Hyun edited comment on SPARK-40508 at 9/21/22 5:47 PM:


Previously, you were in `Contributor` and `Administrator`. I added `Committer` 
group to you additionally to make it sure.


was (Author: dongjoon):
Previously, you are in `Contributor` and `Administrator`. I added `Committer` 
group to you additionally to make it sure.

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607918#comment-17607918
 ] 

Dongjoon Hyun commented on SPARK-40508:
---

Previously, you are in `Contributor` and `Administrator`. I added `Committer` 
group to you additionally to make it sure.

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Sun Chao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607919#comment-17607919
 ] 

Sun Chao commented on SPARK-40508:
--

Great to know. Thanks!

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607917#comment-17607917
 ] 

Dongjoon Hyun commented on SPARK-40508:
---

Ya, the merge script sometimes hit the corner cases. BTW, [~sunchao] , you are 
already in the Apache Spark Admin group. You can add a user.

- [https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles]

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40522:


Assignee: Apache Spark

> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607910#comment-17607910
 ] 

Apache Spark commented on SPARK-40522:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37958

> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40522:


Assignee: (was: Apache Spark)

> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40474) Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread Xiaonan Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Description: 
In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced 
the support of date type in CSV schema inference. The schema inference behavior 
on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify and correct the behavior of the 
last scenario as:
 * For a column containing a mixture of dates and timestamps
 ** If user specifies timestamp format, it will always be inferred as 
`StringType`
 ** If no timestamp format specified by user, we will try inferring it as 
`TimestampType` if possible, otherwise it will be inferred as `StringType`

  was:
In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced 
the support of date type in CSV schema inference. The schema inference behavior 
on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify and correct the behavior of the 
last scenario as:
 * For a column containing a mixture of dates and timestamps
 ** If user specifies timestamp format, it will always be inferred as 
`{{{}StringType`{}}}.
 ** If no timestamp format specified by user, we will try inferring it as 
`{{{}TimestampType`{}}} if possible, otherwise it will be inferred as 
`{{{}StringType`{}}}


> Correct CSV schema inference and data parsing behavior on columns with mixed 
> dates and timestamps
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify and correct the behavior of 
> the last scenario as:
>  * For a column containing a mixture of dates and timestamps
>  ** If user specifies timestamp format, it will always be inferred as 
> `StringType`
>  ** If no timestamp format specified by user, we will try inferring it as 
> `TimestampType` if possible, otherwise it will be inferred as `StringType`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40474) Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread Xiaonan Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Description: 
In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced 
the support of date type in CSV schema inference. The schema inference behavior 
on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify and correct the behavior of the 
last scenario as:
 * For a column containing a mixture of dates and timestamps
 ** If user specifies timestamp format, it will always be inferred as 
`{{{}StringType`{}}}.
 ** If no timestamp format specified by user, we will try inferring it as 
`{{{}TimestampType`{}}} if possible, otherwise it will be inferred as 
`{{{}StringType`{}}}

  was:
In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced 
the support of date type in CSV schema inference. The schema inference behavior 
on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For a column containing a mixture of dates and timestamps, we will infer it 
as String type


> Correct CSV schema inference and data parsing behavior on columns with mixed 
> dates and timestamps
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify and correct the behavior of 
> the last scenario as:
>  * For a column containing a mixture of dates and timestamps
>  ** If user specifies timestamp format, it will always be inferred as 
> `{{{}StringType`{}}}.
>  ** If no timestamp format specified by user, we will try inferring it as 
> `{{{}TimestampType`{}}} if possible, otherwise it will be inferred as 
> `{{{}StringType`{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-40341) Implement `Rolling.median`.

2022-09-21 Thread Artsiom Yudovin (Jira)



[ https://issues.apache.org/jira/browse/SPARK-40341 ]


Artsiom Yudovin deleted comment on SPARK-40341:
-

was (Author: ayudovin):
I'm working on this

> Implement `Rolling.median`.
> ---
>
> Key: SPARK-40341
> URL: https://issues.apache.org/jira/browse/SPARK-40341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `Rolling.median` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40341) Implement `Rolling.median`.

2022-09-21 Thread Artsiom Yudovin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607908#comment-17607908
 ] 

Artsiom Yudovin commented on SPARK-40341:
-

I'm working on this

> Implement `Rolling.median`.
> ---
>
> Key: SPARK-40341
> URL: https://issues.apache.org/jira/browse/SPARK-40341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `Rolling.median` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Sun Chao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607902#comment-17607902
 ] 

Sun Chao commented on SPARK-40508:
--

Oh, thanks [~viirya] ! For some reason the merge script was throwing error at 
me:
{code:java}

response text = {"errorMessages":[],"errors":{"assignee":"User 
'yuzhih...@gmail.com' cannot be assigned issues."}}
Error assigning JIRA, try again (or leave blank and fix manually)
JIRA is unassigned, choose assignee
 {code}

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607900#comment-17607900
 ] 

L. C. Hsieh commented on SPARK-40508:
-

[~csun] Seems he is already in contributor list. I just assigned this ticket to 
him.

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-40508:
---

Assignee: Ted Yu

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40474) Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread Xiaonan Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Summary: Correct CSV schema inference and data parsing behavior on columns 
with mixed dates and timestamps  (was: Infer columns with mixed date and 
timestamp as String in CSV schema inference)

> Correct CSV schema inference and data parsing behavior on columns with mixed 
> dates and timestamps
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
>
> In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40521) PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions instead of the conflicting partition

2022-09-21 Thread Serge Rielau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607897#comment-17607897
 ] 

Serge Rielau commented on SPARK-40521:
--

Hive does return the offending partition. We just need to dig it out  !Screen 
Shot 2022-09-21 at 10.08.44 AM.png!!Screen Shot 2022-09-21 at 10.08.52 AM.png!

> PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
> instead of the conflicting partition
> -
>
> Key: SPARK-40521
> URL: https://issues.apache.org/jira/browse/SPARK-40521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Minor
> Attachments: Screen Shot 2022-09-21 at 10.08.44 AM.png, Screen Shot 
> 2022-09-21 at 10.08.52 AM.png
>
>
> PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
> instead of the conflicting partition
> When I run:
> AlterTableAddPartitionSuiteBase for Hive
> The test: partition already exists
> Fails in my my local build ONLY in that mode because it reports two 
> partitions as conflicting where there should be only one. In all other modes 
> the test succeeds.
> The test is passing on master because the test does not check the partitions 
> themselves.
> Repro on master: Note that c1 = 1 does not already exist. It should NOT be 
> listed 
> create table t(c1 int, c2 int) partitioned by (c1);
> alter table t add partition (c1 = 2);
> alter table t add partition (c1 = 1) partition (c1 = 2);
> 22/09/21 09:30:09 ERROR Hive: AlreadyExistsException(message:Partition 
> already exists: Partition(values:[2], dbName:default, tableName:t, 
> createTime:0, lastAccessTime:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:c2, type:int, comment:null)], 
> location:file:/Users/serge.rielau/spark/spark-warehouse/t/c1=2, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:\{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{}), storedAsSubDirectories:false), 
> parameters:null))
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.startAddPartition(HiveMetaStore.java:2744)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:2442)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:2560)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>  at com.sun.proxy.$Proxy31.add_partitions_req(Unknown Source)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:625)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>  at com.sun.proxy.$Proxy32.add_partitions(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:2103)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:763)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:631)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:296)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
>  at 
>

[jira] [Updated] (SPARK-40521) PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions instead of the conflicting partition

2022-09-21 Thread Serge Rielau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serge Rielau updated SPARK-40521:
-
Attachment: Screen Shot 2022-09-21 at 10.08.52 AM.png
Screen Shot 2022-09-21 at 10.08.44 AM.png

> PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
> instead of the conflicting partition
> -
>
> Key: SPARK-40521
> URL: https://issues.apache.org/jira/browse/SPARK-40521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Minor
> Attachments: Screen Shot 2022-09-21 at 10.08.44 AM.png, Screen Shot 
> 2022-09-21 at 10.08.52 AM.png
>
>
> PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
> instead of the conflicting partition
> When I run:
> AlterTableAddPartitionSuiteBase for Hive
> The test: partition already exists
> Fails in my my local build ONLY in that mode because it reports two 
> partitions as conflicting where there should be only one. In all other modes 
> the test succeeds.
> The test is passing on master because the test does not check the partitions 
> themselves.
> Repro on master: Note that c1 = 1 does not already exist. It should NOT be 
> listed 
> create table t(c1 int, c2 int) partitioned by (c1);
> alter table t add partition (c1 = 2);
> alter table t add partition (c1 = 1) partition (c1 = 2);
> 22/09/21 09:30:09 ERROR Hive: AlreadyExistsException(message:Partition 
> already exists: Partition(values:[2], dbName:default, tableName:t, 
> createTime:0, lastAccessTime:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:c2, type:int, comment:null)], 
> location:file:/Users/serge.rielau/spark/spark-warehouse/t/c1=2, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:\{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{}), storedAsSubDirectories:false), 
> parameters:null))
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.startAddPartition(HiveMetaStore.java:2744)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:2442)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:2560)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>  at com.sun.proxy.$Proxy31.add_partitions_req(Unknown Source)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:625)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>  at com.sun.proxy.$Proxy32.add_partitions(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:2103)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:763)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:631)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:296)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createPartitions(HiveClientImpl.scala:624)
>  at 
>

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607895#comment-17607895
 ] 

Apache Spark commented on SPARK-40508:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37957

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607893#comment-17607893
 ] 

Apache Spark commented on SPARK-40508:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37957

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607869#comment-17607869
 ] 

Chao Sun commented on SPARK-40508:
--

[~dongjoon][~viirya] could you add [~yuzhih...@gmail.com] to the contributor 
list? I can't assign this to him.

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-21 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-40508.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37952
[https://github.com/apache/spark/pull/37952]

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40521) PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions instead of the conflicting partition

2022-09-21 Thread Serge Rielau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serge Rielau updated SPARK-40521:
-
Description: 
PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
instead of the conflicting partition

When I run:
AlterTableAddPartitionSuiteBase for Hive
The test: partition already exists
Fails in my my local build ONLY in that mode because it reports two partitions 
as conflicting where there should be only one. In all other modes the test 
succeeds.
The test is passing on master because the test does not check the partitions 
themselves.

Repro on master: Note that c1 = 1 does not already exist. It should NOT be 
listed 

create table t(c1 int, c2 int) partitioned by (c1);

alter table t add partition (c1 = 2);

alter table t add partition (c1 = 1) partition (c1 = 2);

22/09/21 09:30:09 ERROR Hive: AlreadyExistsException(message:Partition already 
exists: Partition(values:[2], dbName:default, tableName:t, createTime:0, 
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:c2, type:int, 
comment:null)], location:file:/Users/serge.rielau/spark/spark-warehouse/t/c1=2, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:\{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:null))

 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.startAddPartition(HiveMetaStore.java:2744)

 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:2442)

 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:2560)

 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.base/java.lang.reflect.Method.invoke(Method.java:566)

 at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)

 at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)

 at com.sun.proxy.$Proxy31.add_partitions_req(Unknown Source)

 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:625)

 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.base/java.lang.reflect.Method.invoke(Method.java:566)

 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)

 at com.sun.proxy.$Proxy32.add_partitions(Unknown Source)

 at org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:2103)

 at 
org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:763)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:631)

 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:296)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)

 at 
org.apache.spark.sql.hive.client.HiveClientImpl.createPartitions(HiveClientImpl.scala:624)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createPartitions$1(HiveExternalCatalog.scala:1039)

 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:1021)

 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createPartitions(ExternalCatalogWithListener.scala:201)

 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:1169)

 at 
org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.$anonfun$run$17(ddl.scala:514)

 at 
org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.$anonfun$run$17$adapted(ddl.scala:513)

 at

[jira] [Comment Edited] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-21 Thread Franck Thang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607824#comment-17607824
 ] 

Franck Thang edited comment on SPARK-40427 at 9/21/22 3:33 PM:
---

Hi [~dtenedor] ,

I think this is a duplicated ticket of  SPARK-40208


was (Author: stelyus):
Hi [~dtenedor] ,

I think this is a duplicated ticket with this one SPARK-40208

> Add error classes for LIMIT/OFFSET CheckAnalysis failures
> -
>
> Key: SPARK-40427
> URL: https://issues.apache.org/jira/browse/SPARK-40427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-21 Thread Franck Thang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607824#comment-17607824
 ] 

Franck Thang commented on SPARK-40427:
--

Hi [~dtenedor] ,

I think this is a duplicated ticket with this one SPARK-40208

> Add error classes for LIMIT/OFFSET CheckAnalysis failures
> -
>
> Key: SPARK-40427
> URL: https://issues.apache.org/jira/browse/SPARK-40427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40522:

Description: [Memory Allocation with Excessive Size Value 
SNYK-JAVA-ORGAPACHEKAFKA-3027430 
|https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]  (was: 
[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430  Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]

)

> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Memory Allocation with Excessive Size Value SNYK-JAVA-ORGAPACHEKAFKA-3027430 
> |https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40522:

Description: 
[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430  Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]



  was:
[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430|Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]




> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430  Memory 
> Allocation with Excessive Size Value
> SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-40522:

Description: 
[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430|Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]



  was:
[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430|Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]


> Upgrade kafka from 3.2.1 to 3.2.2
> -
>
> Key: SPARK-40522
> URL: https://issues.apache.org/jira/browse/SPARK-40522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430|Memory 
> Allocation with Excessive Size Value
> SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40522) Upgrade kafka from 3.2.1 to 3.2.2

2022-09-21 Thread Jira

Bjørn Jørgensen created SPARK-40522:
---

 Summary: Upgrade kafka from 3.2.1 to 3.2.2
 Key: SPARK-40522
 URL: https://issues.apache.org/jira/browse/SPARK-40522
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


[https://security.snyk.io/vuln/SNYK-JAVA-ORGAPACHEKAFKA-3027430|Memory 
Allocation with Excessive Size Value
SNYK-JAVA-ORGAPACHEKAFKA-3027430]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40521) PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions instead of the conflicting partition

2022-09-21 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-40521:


 Summary: PartitionsAlreadyExistException in Hive V1 Command V1 
reports all partitions instead of the conflicting partition
 Key: SPARK-40521
 URL: https://issues.apache.org/jira/browse/SPARK-40521
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions 
instead of the conflicting partition

When I run:
AlterTableAddPartitionSuiteBase for Hive
The test: partition already exists
Fails in my my local build ONLY in that mode because it reports two partitions 
as conflicting where there should be only one. In all other modes the test 
succeeds.
The test is passing on master because the test does not check the partitions 
themselves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321

2022-09-21 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-40490.
---
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

> `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload 
> after  SPARK-17321
> 
>
> Key: SPARK-40490
> URL: https://issues.apache.org/jira/browse/SPARK-40490
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-17321, YarnShuffleService will persist data to local shuffle 
> state db and reload data from  local shuffle state db only when Yarn 
> NodeManager  start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but 
> `YarnShuffleIntegrationSuite` not set this config and the default value of 
> the configuration is false,  so `YarnShuffleIntegrationSuite` will neither 
> trigger data persistence to the db nor verify the reload of data
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40520) Add a script to generate DOI mainifest

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40520:
---

 Summary: Add a script to generate DOI mainifest
 Key: SPARK-40520
 URL: https://issues.apache.org/jira/browse/SPARK-40520
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40516) Add official image dockerfile for Spark v3.3.0

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40516:

Description: 
Example: [https://github.com/Yikun/spark-docker/tree/master/3.3.0]

Test: 
https://github.com/Yikun/spark-docker/blob/master/.github/workflows/build_3.3.0.yaml

 

  was:
Example: [https://github.com/Yikun/spark-docker/tree/master/3.3.0]

 


> Add official image dockerfile for Spark v3.3.0
> --
>
> Key: SPARK-40516
> URL: https://issues.apache.org/jira/browse/SPARK-40516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Example: [https://github.com/Yikun/spark-docker/tree/master/3.3.0]
> Test: 
> https://github.com/Yikun/spark-docker/blob/master/.github/workflows/build_3.3.0.yaml
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40519) Add "Publish workflow" to help release apache/spark image

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40519:
---

 Summary: Add "Publish workflow" to help release apache/spark image
 Key: SPARK-40519
 URL: https://issues.apache.org/jira/browse/SPARK-40519
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang


Example

[https://github.com/Yikun/spark-docker/blob/master/.github/workflows/publish.yaml]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40516) Add official image dockerfile for Spark v3.3.0

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40516:

Description: 
Example: [https://github.com/Yikun/spark-docker/tree/master/3.3.0]

 

> Add official image dockerfile for Spark v3.3.0
> --
>
> Key: SPARK-40516
> URL: https://issues.apache.org/jira/browse/SPARK-40516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Example: [https://github.com/Yikun/spark-docker/tree/master/3.3.0]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40517) Add DOI manifest file for Spark Docker Official Image

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40517:
---

 Summary: Add DOI manifest file for Spark Docker Official Image
 Key: SPARK-40517
 URL: https://issues.apache.org/jira/browse/SPARK-40517
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Yikun Jiang


[https://github.com/docker-library/official-images/tree/master/library]

Example: [https://github.com/Yikun/official-images/pull/5]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40518) Add Spark Docker Official Image doc

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40518:
---

 Summary: Add Spark Docker Official Image doc
 Key: SPARK-40518
 URL: https://issues.apache.org/jira/browse/SPARK-40518
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Yikun Jiang


[https://github.com/docker-library/docs]

Example: https://github.com/Yikun/docker-library-docs/pull/1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40516) Add official image dockerfile for Spark v3.3.0

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40516:
---

 Summary: Add official image dockerfile for Spark v3.3.0
 Key: SPARK-40516
 URL: https://issues.apache.org/jira/browse/SPARK-40516
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes, PySpark, SparkR
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40515) Add apache/spark-docker repo

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40515:
---

 Summary: Add apache/spark-docker repo
 Key: SPARK-40515
 URL: https://issues.apache.org/jira/browse/SPARK-40515
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-09-21 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40175:
-
Priority: Minor  (was: Major)

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40494.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37940
[https://github.com/apache/spark/pull/37940]

> Optimize the performance of `keys.zipWithIndex.toMap` code pattern 
> ---
>
> Key: SPARK-40494
> URL: https://issues.apache.org/jira/browse/SPARK-40494
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
> the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern

2022-09-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40494:
---

Assignee: Yang Jie

> Optimize the performance of `keys.zipWithIndex.toMap` code pattern 
> ---
>
> Key: SPARK-40494
> URL: https://issues.apache.org/jira/browse/SPARK-40494
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Similar as SPARK-40175, can use  {{`while loop manually}} style` to optimize 
> the performance of `keys.zipWithIndex.toMap` code pattern in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40514) Python related tests need to check the Python version

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607606#comment-17607606
 ] 

Apache Spark commented on SPARK-40514:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37956

> Python related tests need to check the Python version
> -
>
> Key: SPARK-40514
> URL: https://issues.apache.org/jira/browse/SPARK-40514
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Spark 3.4 support Python 3.7+ , but UT only check python3 exists , not check 
> the python version 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40514) Python related tests need to check the Python version

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40514:


Assignee: Apache Spark

> Python related tests need to check the Python version
> -
>
> Key: SPARK-40514
> URL: https://issues.apache.org/jira/browse/SPARK-40514
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Spark 3.4 support Python 3.7+ , but UT only check python3 exists , not check 
> the python version 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40514) Python related tests need to check the Python version

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40514:


Assignee: (was: Apache Spark)

> Python related tests need to check the Python version
> -
>
> Key: SPARK-40514
> URL: https://issues.apache.org/jira/browse/SPARK-40514
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Spark 3.4 support Python 3.7+ , but UT only check python3 exists , not check 
> the python version 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40514) Python related tests need to check the Python version

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607605#comment-17607605
 ] 

Apache Spark commented on SPARK-40514:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37956

> Python related tests need to check the Python version
> -
>
> Key: SPARK-40514
> URL: https://issues.apache.org/jira/browse/SPARK-40514
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Spark 3.4 support Python 3.7+ , but UT only check python3 exists , not check 
> the python version 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-21 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40498:
-

Assignee: Ruifeng Zheng

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-21 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40498.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37945
[https://github.com/apache/spark/pull/37945]

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40514) Python related tests need to check the Python version

2022-09-21 Thread Yang Jie (Jira)

Yang Jie created SPARK-40514:


 Summary: Python related tests need to check the Python version
 Key: SPARK-40514
 URL: https://issues.apache.org/jira/browse/SPARK-40514
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.4.0
Reporter: Yang Jie


Spark 3.4 support Python 3.7+ , but UT only check python3 exists , not check 
the python version 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40513:

Issue Type: New Feature  (was: Umbrella)

> SPIP: Support Docker Official Image for Spark
> -
>
> Key: SPARK-40513
> URL: https://issues.apache.org/jira/browse/SPARK-40513
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>  Labels: SPIP
>
> This SPIP is proposed to add [Docker Official 
> Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
> Spark Docker images meet the quality standards for Docker images, to provide 
> these Docker images for users who want to use Apache Spark via Docker image.
> There are also several [Apache projects that release the Docker Official 
> Images|https://hub.docker.com/search?q=apache_filter=official], such 
> as: [flink|https://hub.docker.com/_/flink], 
> [storm|https://hub.docker.com/_/storm], [solr|https://hub.docker.com/_/solr], 
> [zookeeper|https://hub.docker.com/_/zookeeper], 
> [httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
> From the huge download statistics, we can see the real demands of users, and 
> from the support of other apache projects, we should also be able to do it.
> After support:
>  * The Dockerfile will still be maintained by the Apache Spark community and 
> reviewed by Docker.
>  * The images will be maintained by the Docker community to ensure the 
> quality standards for Docker images of the Docker community.
> It will also reduce the extra docker images maintenance effort (such as 
> frequently rebuilding, image security update) of the Apache Spark community.
>  
> SPIP DOC: 
> [https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o]
> DISCUSS: [https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40513:

Issue Type: Umbrella  (was: Bug)

> SPIP: Support Docker Official Image for Spark
> -
>
> Key: SPARK-40513
> URL: https://issues.apache.org/jira/browse/SPARK-40513
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> This SPIP is proposed to add [Docker Official 
> Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
> Spark Docker images meet the quality standards for Docker images, to provide 
> these Docker images for users who want to use Apache Spark via Docker image.
> There are also several [Apache projects that release the Docker Official 
> Images|https://hub.docker.com/search?q=apache_filter=official], such 
> as: [flink|https://hub.docker.com/_/flink], 
> [storm|https://hub.docker.com/_/storm], [solr|https://hub.docker.com/_/solr], 
> [zookeeper|https://hub.docker.com/_/zookeeper], 
> [httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
> From the huge download statistics, we can see the real demands of users, and 
> from the support of other apache projects, we should also be able to do it.
> After support:
>  * The Dockerfile will still be maintained by the Apache Spark community and 
> reviewed by Docker.
>  * The images will be maintained by the Docker community to ensure the 
> quality standards for Docker images of the Docker community.
> It will also reduce the extra docker images maintenance effort (such as 
> frequently rebuilding, image security update) of the Apache Spark community.
>  
> SPIP DOC: 
> [https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o]
> DISCUSS: [https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40513:

Labels: SPIP  (was: )

> SPIP: Support Docker Official Image for Spark
> -
>
> Key: SPARK-40513
> URL: https://issues.apache.org/jira/browse/SPARK-40513
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>  Labels: SPIP
>
> This SPIP is proposed to add [Docker Official 
> Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
> Spark Docker images meet the quality standards for Docker images, to provide 
> these Docker images for users who want to use Apache Spark via Docker image.
> There are also several [Apache projects that release the Docker Official 
> Images|https://hub.docker.com/search?q=apache_filter=official], such 
> as: [flink|https://hub.docker.com/_/flink], 
> [storm|https://hub.docker.com/_/storm], [solr|https://hub.docker.com/_/solr], 
> [zookeeper|https://hub.docker.com/_/zookeeper], 
> [httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
> From the huge download statistics, we can see the real demands of users, and 
> from the support of other apache projects, we should also be able to do it.
> After support:
>  * The Dockerfile will still be maintained by the Apache Spark community and 
> reviewed by Docker.
>  * The images will be maintained by the Docker community to ensure the 
> quality standards for Docker images of the Docker community.
> It will also reduce the extra docker images maintenance effort (such as 
> frequently rebuilding, image security update) of the Apache Spark community.
>  
> SPIP DOC: 
> [https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o]
> DISCUSS: [https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40513:

Description: 
This SPIP is proposed to add [Docker Official 
Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
Spark Docker images meet the quality standards for Docker images, to provide 
these Docker images for users who want to use Apache Spark via Docker image.

There are also several [Apache projects that release the Docker Official 
Images|https://hub.docker.com/search?q=apache_filter=official], such as: 
[flink|https://hub.docker.com/_/flink], [storm|https://hub.docker.com/_/storm], 
[solr|https://hub.docker.com/_/solr], 
[zookeeper|https://hub.docker.com/_/zookeeper], 
[httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
From the huge download statistics, we can see the real demands of users, and 
from the support of other apache projects, we should also be able to do it.

After support:
 * The Dockerfile will still be maintained by the Apache Spark community and 
reviewed by Docker.

 * The images will be maintained by the Docker community to ensure the quality 
standards for Docker images of the Docker community.

It will also reduce the extra docker images maintenance effort (such as 
frequently rebuilding, image security update) of the Apache Spark community.

 

SPIP DOC: 
[https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o]

DISCUSS: [https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3]

  was:
This SPIP is proposed to add [Docker Official 
Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
Spark Docker images meet the quality standards for Docker images, to provide 
these Docker images for users who want to use Apache Spark via Docker image.

There are also several [Apache projects that release the Docker Official 
Images|https://hub.docker.com/search?q=apache_filter=official], such as: 
[flink|https://hub.docker.com/_/flink], [storm|https://hub.docker.com/_/storm], 
[solr|https://hub.docker.com/_/solr], 
[zookeeper|https://hub.docker.com/_/zookeeper], 
[httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
From the huge download statistics, we can see the real demands of users, and 
from the support of other apache projects, we should also be able to do it.

After support:
 * The Dockerfile will still be maintained by the Apache Spark community and 
reviewed by Docker.

 * The images will be maintained by the Docker community to ensure the quality 
standards for Docker images of the Docker community.

It will also reduce the extra docker images maintenance effort (such as 
frequently rebuilding, image security update) of the Apache Spark community.

DISCUSS: https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3


> SPIP: Support Docker Official Image for Spark
> -
>
> Key: SPARK-40513
> URL: https://issues.apache.org/jira/browse/SPARK-40513
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> This SPIP is proposed to add [Docker Official 
> Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
> Spark Docker images meet the quality standards for Docker images, to provide 
> these Docker images for users who want to use Apache Spark via Docker image.
> There are also several [Apache projects that release the Docker Official 
> Images|https://hub.docker.com/search?q=apache_filter=official], such 
> as: [flink|https://hub.docker.com/_/flink], 
> [storm|https://hub.docker.com/_/storm], [solr|https://hub.docker.com/_/solr], 
> [zookeeper|https://hub.docker.com/_/zookeeper], 
> [httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
> From the huge download statistics, we can see the real demands of users, and 
> from the support of other apache projects, we should also be able to do it.
> After support:
>  * The Dockerfile will still be maintained by the Apache Spark community and 
> reviewed by Docker.
>  * The images will be maintained by the Docker community to ensure the 
> quality standards for Docker images of the Docker community.
> It will also reduce the extra docker images maintenance effort (such as 
> frequently rebuilding, image security update) of the Apache Spark community.
>  
> SPIP DOC: 
> [https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o]
> DISCUSS: [https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Updated] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-40513:

Description: 
This SPIP is proposed to add [Docker Official 
Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
Spark Docker images meet the quality standards for Docker images, to provide 
these Docker images for users who want to use Apache Spark via Docker image.

There are also several [Apache projects that release the Docker Official 
Images|https://hub.docker.com/search?q=apache_filter=official], such as: 
[flink|https://hub.docker.com/_/flink], [storm|https://hub.docker.com/_/storm], 
[solr|https://hub.docker.com/_/solr], 
[zookeeper|https://hub.docker.com/_/zookeeper], 
[httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
From the huge download statistics, we can see the real demands of users, and 
from the support of other apache projects, we should also be able to do it.

After support:
 * The Dockerfile will still be maintained by the Apache Spark community and 
reviewed by Docker.

 * The images will be maintained by the Docker community to ensure the quality 
standards for Docker images of the Docker community.

It will also reduce the extra docker images maintenance effort (such as 
frequently rebuilding, image security update) of the Apache Spark community.

DISCUSS: https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3

> SPIP: Support Docker Official Image for Spark
> -
>
> Key: SPARK-40513
> URL: https://issues.apache.org/jira/browse/SPARK-40513
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark, SparkR
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> This SPIP is proposed to add [Docker Official 
> Image(DOI)|https://github.com/docker-library/official-images] to ensure the 
> Spark Docker images meet the quality standards for Docker images, to provide 
> these Docker images for users who want to use Apache Spark via Docker image.
> There are also several [Apache projects that release the Docker Official 
> Images|https://hub.docker.com/search?q=apache_filter=official], such 
> as: [flink|https://hub.docker.com/_/flink], 
> [storm|https://hub.docker.com/_/storm], [solr|https://hub.docker.com/_/solr], 
> [zookeeper|https://hub.docker.com/_/zookeeper], 
> [httpd|https://hub.docker.com/_/httpd] (with 50M+ to 1B+ download for each). 
> From the huge download statistics, we can see the real demands of users, and 
> from the support of other apache projects, we should also be able to do it.
> After support:
>  * The Dockerfile will still be maintained by the Apache Spark community and 
> reviewed by Docker.
>  * The images will be maintained by the Docker community to ensure the 
> quality standards for Docker images of the Docker community.
> It will also reduce the extra docker images maintenance effort (such as 
> frequently rebuilding, image security update) of the Apache Spark community.
> DISCUSS: https://lists.apache.org/thread/l1793y5224n8bqkp3s6ltgkykso4htb3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40513) SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-40513:
---

 Summary: SPIP: Support Docker Official Image for Spark
 Key: SPARK-40513
 URL: https://issues.apache.org/jira/browse/SPARK-40513
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, PySpark, SparkR
Affects Versions: 3.4.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40512) Upgrade pandas to 1.5.0

2022-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607527#comment-17607527
 ] 

Apache Spark commented on SPARK-40512:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37955

> Upgrade pandas to 1.5.0
> ---
>
> Key: SPARK-40512
> URL: https://issues.apache.org/jira/browse/SPARK-40512
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since pandas 1.5.0 is released in Sep 19, 2022.
>  
> We should update our infra and docs to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-09-21 Thread Joost Farla (Jira)



[ https://issues.apache.org/jira/browse/SPARK-34805 ]


Joost Farla deleted comment on SPARK-34805:
-

was (Author: JIRAUSER295969):
[~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It 
looks like the fix was merged into the 3.3 branch (on March 21st), but was not 
yet released as part of v3.3. It is also not mentioned in the release notes. Is 
that possible? Thanks in advance!

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40512) Upgrade pandas to 1.5.0

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40512:


Assignee: Apache Spark

> Upgrade pandas to 1.5.0
> ---
>
> Key: SPARK-40512
> URL: https://issues.apache.org/jira/browse/SPARK-40512
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Since pandas 1.5.0 is released in Sep 19, 2022.
>  
> We should update our infra and docs to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40512) Upgrade pandas to 1.5.0

2022-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40512:


Assignee: (was: Apache Spark)

> Upgrade pandas to 1.5.0
> ---
>
> Key: SPARK-40512
> URL: https://issues.apache.org/jira/browse/SPARK-40512
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since pandas 1.5.0 is released in Sep 19, 2022.
>  
> We should update our infra and docs to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-21 Thread CaoYu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607523#comment-17607523
 ] 

CaoYu edited comment on SPARK-40502 at 9/21/22 6:07 AM:


I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

 

Because the course is very basic, simple rdd code is suitable as an example.
But if you use DataFrame, you need to explain more content, which is not 
friendly to novice students

DataFrames(SparkSQL) will be used in future design advanced courses.

So I hope that the extraction of jdbc data may be completed through the api of 
rdd

 

 

 


was (Author: javacaoyu):
I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

DataFrames(SparkSQL) will be used in future design advanced courses.
So I hope the datastream API to have the capability of jdbc datasource.

 

 

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-21 Thread CaoYu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607524#comment-17607524
 ] 

CaoYu commented on SPARK-40502:
---

When I designed the Python Flink course
It is found that PyFlink does not have the operators sum\min\minby\max\maxby

So I submitted a PR to the flink community and provided the python 
implementation code of these operators （FLINK-26609 FLINK-26728)

So, again, if jdbc datasource is what pyspark needs, I'd love and have the time 
to implement it

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40512) Upgrade pandas to 1.5.0

2022-09-21 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-40512:
---

 Summary: Upgrade pandas to 1.5.0
 Key: SPARK-40512
 URL: https://issues.apache.org/jira/browse/SPARK-40512
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Since pandas 1.5.0 is released in Sep 19, 2022.

 

We should update our infra and docs to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo