[jira] [Updated] (SPARK-45672) Provide a unified user-facing schema for state format versions in state data source - reader

2023-10-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-45672:
-
Parent: SPARK-45511
Issue Type: Sub-task  (was: Improvement)

> Provide a unified user-facing schema for state format versions in state data 
> source - reader
> 
>
> Key: SPARK-45672
> URL: https://issues.apache.org/jira/browse/SPARK-45672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> As of now, except stream-stream join with joinSide option being specified, 
> state data source would provide the state "as it is" in the state store. This 
> means state data source will provide the different schema for operators 
> having multiple state format versions.
> From users' perspective, they do not care about the state format version, 
> hence may be confused if the state data source produces different schema.
> That said, we could probably consider defining and providing the same user 
> facing schema for each operator.
> *Note that this would need further discussion* before coming up with code, 
> because there is a clear trade-off. It makes a strong coupling between state 
> data source and the implementation of stateful operators. Also, for the 
> argument of non-predictable output schema, users can call printSchema() to 
> see the output schema in prior to query.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45672) Provide a unified user-facing schema for state format versions in state data source - reader

2023-10-25 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-45672:


 Summary: Provide a unified user-facing schema for state format 
versions in state data source - reader
 Key: SPARK-45672
 URL: https://issues.apache.org/jira/browse/SPARK-45672
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


As of now, except stream-stream join with joinSide option being specified, 
state data source would provide the state "as it is" in the state store. This 
means state data source will provide the different schema for operators having 
multiple state format versions.

>From users' perspective, they do not care about the state format version, 
>hence may be confused if the state data source produces different schema.

That said, we could probably consider defining and providing the same user 
facing schema for each operator.

*Note that this would need further discussion* before coming up with code, 
because there is a clear trade-off. It makes a strong coupling between state 
data source and the implementation of stateful operators. Also, for the 
argument of non-predictable output schema, users can call printSchema() to see 
the output schema in prior to query.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45671) Implement an option similar to corrupt record column in State Data Source Reader

2023-10-25 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-45671:


 Summary: Implement an option similar to corrupt record column in 
State Data Source Reader
 Key: SPARK-45671
 URL: https://issues.apache.org/jira/browse/SPARK-45671
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


Querying against the state would be most likely failing if the underlying state 
file is corrupted. There may be another case that the binary data (raw) state 
store read from state file does not fit with state schema and ends up with 
exception/fatal error in runtime.

(We can't catch the case where the data is loaded with incorrect schema if it 
does not throw an exception. We cannot add the schema for every data.)

To handle above cases without failure, we want to provide state rows for valid 
rows, with also providing binary data for corrupted rows (like we do for 
CSV/JSON) if users specify an option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38723) Test the error class: CONCURRENT_QUERY

2023-10-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779748#comment-17779748
 ] 

Jungtaek Lim commented on SPARK-38723:
--

(I just asked about the Jira ccount to the PR author. I'll reassign once I hear 
about the answer.)

> Test the error class: CONCURRENT_QUERY
> --
>
> Key: SPARK-38723
> URL: https://issues.apache.org/jira/browse/SPARK-38723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
> Fix For: 4.0.0
>
>
> Add at least one test for the error class *CONCURRENT_QUERY* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def concurrentQueryInstanceError(): Throwable = {
> new SparkConcurrentModificationException("CONCURRENT_QUERY", Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38723) Test the error class: CONCURRENT_QUERY

2023-10-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38723.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43405
[https://github.com/apache/spark/pull/43405]

> Test the error class: CONCURRENT_QUERY
> --
>
> Key: SPARK-38723
> URL: https://issues.apache.org/jira/browse/SPARK-38723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
> Fix For: 4.0.0
>
>
> Add at least one test for the error class *CONCURRENT_QUERY* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def concurrentQueryInstanceError(): Throwable = {
> new SparkConcurrentModificationException("CONCURRENT_QUERY", Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45663.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43527
[https://github.com/apache/spark/pull/43527]

> Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`
> ---
>
> Key: SPARK-45663
> URL: https://issues.apache.org/jira/browse/SPARK-45663
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> @deprecated("`aggregate` is not relevant for sequential collections. Use 
> `foldLeft(z)(seqop)` instead.", "2.13.0")
> def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
> foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45663:


Assignee: Yang Jie

> Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`
> ---
>
> Key: SPARK-45663
> URL: https://issues.apache.org/jira/browse/SPARK-45663
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> @deprecated("`aggregate` is not relevant for sequential collections. Use 
> `foldLeft(z)(seqop)` instead.", "2.13.0")
> def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
> foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44407) Clean up the compilation warnings related to `it will become a keyword in Scala 3`

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44407:


Assignee: Yang Jie

> Clean up the compilation warnings related to `it will become a keyword in 
> Scala 3`
> --
>
> Key: SPARK-44407
> URL: https://issues.apache.org/jira/browse/SPARK-44407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/JavaTypeInferenceSuite.scala:74:21:
>  [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use 
> it as an identifier, it will become a keyword in Scala 3.
> [warn]   @BeanProperty var enum: java.time.Month = _ {code}
> enum will become a keyword in Scala 3, this also includes {{export}} and 
> {{{}given{}}}.
>  
> Scala 2.13
> {code:java}
> Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8).
> Type in expressions for evaluation. Or try :help.
> scala> val enum: Int = 1
>            ^
>        warning: Wrap `enum` in backticks to use it as an identifier, it will 
> become a keyword in Scala 3. [quickfixable]
> val enum: Int = 1
> scala> val export: Int = 1
>            ^
>        warning: Wrap `export` in backticks to use it as an identifier, it 
> will become a keyword in Scala 3. [quickfixable]
> val export: Int = 1
> scala> val given: Int = 1
>            ^
>        warning: Wrap `given` in backticks to use it as an identifier, it will 
> become a keyword in Scala 3. [quickfixable]
> val given: Int = 1 {code}
>  
> Scala 3
>  
> {code:java}
> Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM).
> Type in expressions for evaluation. Or try :help.
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val enum: Int = 1
> -- [E032] Syntax Error: 
> 
> 1 |val enum: Int = 1
>   |    
>   |    pattern expected
>   |
>   | longer explanation available when compiling with `-explain`
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val export: Int = 1
> -- [E032] Syntax Error: 
> 
> 1 |val export: Int = 1
>   |    ^^
>   |    pattern expected
>   |
>   | longer explanation available when compiling with `-explain`
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val given: Int = 1
> -- [E040] Syntax Error: 
> 
> 1 |val given: Int = 1
>   |         ^
>   |         an identifier expected, but ':' found
>   |
>   | longer explanation available when compiling with `-explain` {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44407) Clean up the compilation warnings related to `it will become a keyword in Scala 3`

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44407.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43529
[https://github.com/apache/spark/pull/43529]

> Clean up the compilation warnings related to `it will become a keyword in 
> Scala 3`
> --
>
> Key: SPARK-44407
> URL: https://issues.apache.org/jira/browse/SPARK-44407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/JavaTypeInferenceSuite.scala:74:21:
>  [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use 
> it as an identifier, it will become a keyword in Scala 3.
> [warn]   @BeanProperty var enum: java.time.Month = _ {code}
> enum will become a keyword in Scala 3, this also includes {{export}} and 
> {{{}given{}}}.
>  
> Scala 2.13
> {code:java}
> Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8).
> Type in expressions for evaluation. Or try :help.
> scala> val enum: Int = 1
>            ^
>        warning: Wrap `enum` in backticks to use it as an identifier, it will 
> become a keyword in Scala 3. [quickfixable]
> val enum: Int = 1
> scala> val export: Int = 1
>            ^
>        warning: Wrap `export` in backticks to use it as an identifier, it 
> will become a keyword in Scala 3. [quickfixable]
> val export: Int = 1
> scala> val given: Int = 1
>            ^
>        warning: Wrap `given` in backticks to use it as an identifier, it will 
> become a keyword in Scala 3. [quickfixable]
> val given: Int = 1 {code}
>  
> Scala 3
>  
> {code:java}
> Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM).
> Type in expressions for evaluation. Or try :help.
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val enum: Int = 1
> -- [E032] Syntax Error: 
> 
> 1 |val enum: Int = 1
>   |    
>   |    pattern expected
>   |
>   | longer explanation available when compiling with `-explain`
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val export: Int = 1
> -- [E032] Syntax Error: 
> 
> 1 |val export: Int = 1
>   |    ^^
>   |    pattern expected
>   |
>   | longer explanation available when compiling with `-explain`
>                                                                               
>                                                                               
>                                                                               
>            
> scala> val given: Int = 1
> -- [E040] Syntax Error: 
> 
> 1 |val given: Int = 1
>   |         ^
>   |         an identifier expected, but ':' found
>   |
>   | longer explanation available when compiling with `-explain` {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45670) SparkSubmit does not support --total-executor-cores when deploying on K8s

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45670:
---
Labels: pull-request-available  (was: )

> SparkSubmit does not support --total-executor-cores when deploying on K8s
> -
>
> Key: SPARK-45670
> URL: https://issues.apache.org/jira/browse/SPARK-45670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45670) SparkSubmit does not support --total-executor-cores when deploying on K8s

2023-10-25 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-45670:
-

 Summary: SparkSubmit does not support --total-executor-cores when 
deploying on K8s
 Key: SPARK-45670
 URL: https://issues.apache.org/jira/browse/SPARK-45670
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.5.0, 3.4.1, 3.3.3
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43923) [CONNECT] Post listenerBus events during ExecutePlanRequest

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43923:
---
Labels: pull-request-available  (was: )

> [CONNECT] Post listenerBus events during ExecutePlanRequest
> ---
>
> Key: SPARK-43923
> URL: https://issues.apache.org/jira/browse/SPARK-43923
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Jean-Francois Desjeans Gauthier
>Assignee: Jean-Francois Desjeans Gauthier
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 4.0.0
>
>
> Post events SparkListenerConnectOperationStarted, 
> SparkListenerConnectOperationParsed, SparkListenerConnectOperationCanceled,  
> SparkListenerConnectOperationFailed, SparkListenerConnectOperationFinished, 
> SparkListenerConnectOperationClosed & SparkListenerConnectSessionClosed.
> Mirror what is currently available for in HiveThriftServer2EventManager



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45669) Ensure the continuity of rolling log index

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45669:
---
Labels: pull-request-available  (was: )

> Ensure the continuity of rolling log index
> --
>
> Key: SPARK-45669
> URL: https://issues.apache.org/jira/browse/SPARK-45669
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: shuyouZZ
>Priority: Major
>  Labels: pull-request-available
>
> Current the log file index will increment before running `initLogFile()`. 
> When running `rollEventlogFile()`, the index will increment first, then 
> create new log file to init event log writer.
>  
> If the log file creation fails, `initLogFile()` will throw an exception and 
> stop running the method. The log file index will still increment next time 
> `rollEventLogFile()` is called,
> which will cause the file index to become discontinuous. EventLogFileReader 
> can not read the event log files normally.
>  
> Therefore, we need to update the logic here to ensure the continuity of 
> rolling log index 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45667) Clean up the deprecated API usage related to `IterableOnceExtensionMethods`.

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45667:
---
Labels: pull-request-available  (was: )

> Clean up the deprecated API usage related to `IterableOnceExtensionMethods`.
> 
>
> Key: SPARK-45667
> URL: https://issues.apache.org/jira/browse/SPARK-45667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45668) Improve the assert message in `RollingEventLogFilesFileReader`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45668:
---
Labels: pull-request-available  (was: )

> Improve the assert message in `RollingEventLogFilesFileReader`
> --
>
> Key: SPARK-45668
> URL: https://issues.apache.org/jira/browse/SPARK-45668
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: shuyouZZ
>Priority: Minor
>  Labels: pull-request-available
>
> Currently the assert message in `RollingEventLogFilesFileReader` is not 
> clear. When the assertion fails, it's difficult to find the event log file 
> that is having problems.
> In this ticket, we will update the following two places to make the assert 
> message clearer.
>  * {*}val files{*}. Add `rootPath{*}`{*} field in the assert message.
>  * {*}val eventLogFiles{*}. Add new method `findMissingIndices`, only print 
> the missing event log indices instead of all indices.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45665) Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other branches

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45665.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43496
[https://github.com/apache/spark/pull/43496]

> Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other 
> branches
> -
>
> Key: SPARK-45665
> URL: https://issues.apache.org/jira/browse/SPARK-45665
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45665) Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other branches

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45665:


Assignee: Yang Jie

> Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other 
> branches
> -
>
> Key: SPARK-45665
> URL: https://issues.apache.org/jira/browse/SPARK-45665
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45669) Ensure the continuity of rolling log index

2023-10-25 Thread shuyouZZ (Jira)
shuyouZZ created SPARK-45669:


 Summary: Ensure the continuity of rolling log index
 Key: SPARK-45669
 URL: https://issues.apache.org/jira/browse/SPARK-45669
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: shuyouZZ


Current the log file index will increment before running `initLogFile()`. When 
running `rollEventlogFile()`, the index will increment first, then create new 
log file to init event log writer.
 
If the log file creation fails, `initLogFile()` will throw an exception and 
stop running the method. The log file index will still increment next time 
`rollEventLogFile()` is called,
which will cause the file index to become discontinuous. EventLogFileReader can 
not read the event log files normally.
 
Therefore, we need to update the logic here to ensure the continuity of rolling 
log index 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45668) Improve the assert message in `RollingEventLogFilesFileReader`

2023-10-25 Thread shuyouZZ (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shuyouZZ updated SPARK-45668:
-
Description: 
Currently the assert message in `RollingEventLogFilesFileReader` is not clear. 
When the assertion fails, it's difficult to find the event log file that is 
having problems.

In this ticket, we will update the following two places to make the assert 
message clearer.
 * {*}val files{*}. Add `rootPath{*}`{*} field in the assert message.
 * {*}val eventLogFiles{*}. Add new method `findMissingIndices`, only print the 
missing event log indices instead of all indices.

  was:
Currently the assert message in RollingEventLogFilesFileReader ** is not clear. 
When the assertion fails, it's difficult to find the event log file that is 
having problems.

In this ticket, we will update the following two places to make the assert 
message clearer.
 * {*}val files{*}. Add `rootPath{*}`{*} field in the assert message.
 * {*}val eventLogFiles{*}. Add new method `findMissingIndices`, only print the 
missing event log indices instead of all indices.


> Improve the assert message in `RollingEventLogFilesFileReader`
> --
>
> Key: SPARK-45668
> URL: https://issues.apache.org/jira/browse/SPARK-45668
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: shuyouZZ
>Priority: Minor
>
> Currently the assert message in `RollingEventLogFilesFileReader` is not 
> clear. When the assertion fails, it's difficult to find the event log file 
> that is having problems.
> In this ticket, we will update the following two places to make the assert 
> message clearer.
>  * {*}val files{*}. Add `rootPath{*}`{*} field in the assert message.
>  * {*}val eventLogFiles{*}. Add new method `findMissingIndices`, only print 
> the missing event log indices instead of all indices.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45668) Improve the assert message in `RollingEventLogFilesFileReader`

2023-10-25 Thread shuyouZZ (Jira)
shuyouZZ created SPARK-45668:


 Summary: Improve the assert message in 
`RollingEventLogFilesFileReader`
 Key: SPARK-45668
 URL: https://issues.apache.org/jira/browse/SPARK-45668
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: shuyouZZ


Currently the assert message in RollingEventLogFilesFileReader ** is not clear. 
When the assertion fails, it's difficult to find the event log file that is 
having problems.

In this ticket, we will update the following two places to make the assert 
message clearer.
 * {*}val files{*}. Add `rootPath{*}`{*} field in the assert message.
 * {*}val eventLogFiles{*}. Add new method `findMissingIndices`, only print the 
missing event log indices instead of all indices.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45650.
--
Resolution: Not A Problem

> fix dev/mina get scala 2.12 
> 
>
> Key: SPARK-45650
> URL: https://issues.apache.org/jira/browse/SPARK-45650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: tangjiafu
>Priority: Major
>
> Now the ci is executing  ./dev/mina will generate an incompatible error with 
> scala2.12. Sorry, I don't know how to fix it
> [info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some 
> time)...
> [info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-25 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779724#comment-17779724
 ] 

Yang Jie commented on SPARK-45650:
--

{code:java}
[error] spark-sql-api: Failed binary compatibility check against 
org.apache.spark:spark-sql-api_2.13:3.5.0! Found 6 potential problems (filtered 
28)
[error]  * interface org.apache.spark.sql.types.DoubleType#DoubleAsIfIntegral 
does not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.DoubleType$DoubleAsIfIntegral")
[error]  * object org.apache.spark.sql.types.DoubleType#DoubleAsIfIntegral does 
not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.DoubleType$DoubleAsIfIntegral$")
[error]  * interface org.apache.spark.sql.types.DoubleType#DoubleIsConflicted 
does not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.DoubleType$DoubleIsConflicted")
[error]  * interface org.apache.spark.sql.types.FloatType#FloatAsIfIntegral 
does not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.FloatType$FloatAsIfIntegral")
[error]  * object org.apache.spark.sql.types.FloatType#FloatAsIfIntegral does 
not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.FloatType$FloatAsIfIntegral$")
[error]  * interface org.apache.spark.sql.types.FloatType#FloatIsConflicted 
does not have a correspondent in current version
[error]    filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.types.FloatType$FloatIsConflicted")
 {code}
[https://github.com/laglangyue/spark/actions/runs/6614029427/job/17963169741]

 

As the compilation log says, this is because your PR changes broke the related 
API compatibility. You need to modify the code to maintain API compatibility or 
add the corresponding rules in MimaExcludes to skip the check. This is not a 
bug with the mima script, I will close this issue.

> fix dev/mina get scala 2.12 
> 
>
> Key: SPARK-45650
> URL: https://issues.apache.org/jira/browse/SPARK-45650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: tangjiafu
>Priority: Major
>
> Now the ci is executing  ./dev/mina will generate an incompatible error with 
> scala2.12. Sorry, I don't know how to fix it
> [info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some 
> time)...
> [info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45667) Clean up the deprecated API usage related to `IterableOnceExtensionMethods`.

2023-10-25 Thread Yang Jie (Jira)
Yang Jie created SPARK-45667:


 Summary: Clean up the deprecated API usage related to 
`IterableOnceExtensionMethods`.
 Key: SPARK-45667
 URL: https://issues.apache.org/jira/browse/SPARK-45667
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45665) Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other branches

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45665:
---
Labels: pull-request-available  (was: )

> Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other 
> branches
> -
>
> Key: SPARK-45665
> URL: https://issues.apache.org/jira/browse/SPARK-45665
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45635) Cleanup unused import for PySpark testing

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45635:


Assignee: Haejoon Lee

> Cleanup unused import for PySpark testing
> -
>
> Key: SPARK-45635
> URL: https://issues.apache.org/jira/browse/SPARK-45635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Cleanup unused import for PySpark testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45635) Cleanup unused import for PySpark testing

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45635.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43489
[https://github.com/apache/spark/pull/43489]

> Cleanup unused import for PySpark testing
> -
>
> Key: SPARK-45635
> URL: https://issues.apache.org/jira/browse/SPARK-45635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Cleanup unused import for PySpark testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45634) Remove `get_dtype_counts` from Pandas API on Spark

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45634.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43488
[https://github.com/apache/spark/pull/43488]

> Remove `get_dtype_counts` from Pandas API on Spark
> --
>
> Key: SPARK-45634
> URL: https://issues.apache.org/jira/browse/SPARK-45634
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The internal API get_dtype_counts is no longer used from Pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-19335:
---
Labels: pull-request-available  (was: )

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>  Labels: pull-request-available
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45634) Remove `get_dtype_counts` from Pandas API on Spark

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45634:


Assignee: Haejoon Lee

> Remove `get_dtype_counts` from Pandas API on Spark
> --
>
> Key: SPARK-45634
> URL: https://issues.apache.org/jira/browse/SPARK-45634
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> The internal API get_dtype_counts is no longer used from Pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45136) Improve ClosureCleaner to support closures defined in Ammonite REPL

2023-10-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-45136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-45136:
-

Assignee: Vsevolod Stepanov  (was: Herman van Hövell)

> Improve ClosureCleaner to support closures defined in Ammonite REPL
> ---
>
> Key: SPARK-45136
> URL: https://issues.apache.org/jira/browse/SPARK-45136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Vsevolod Stepanov
>Assignee: Vsevolod Stepanov
>Priority: Major
>  Labels: pull-request-available
>
> ConnectRepl uses Ammonite REPL with  CodeClassWrapper to run Scala code. It 
> means that each code cell is wrapped into a separate object. If there are 
> multiple variables defined in the same cell / code block it will lead to 
> capturing extra variables, increasing serialized UDF payload size or making 
> it non-serializable.
> For example, this code
> {code:java}
> // cell 1 
> {
>   val x = 100
>   val y = new NonSerializable
> }
> // cell 2
> spark.range(10).map(i => i + x).agg(sum("value")).collect(){code}
> will fail because lambda will capture both `x` and `y` as they're defined in 
> the same wrapper object



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45136) Improve ClosureCleaner to support closures defined in Ammonite REPL

2023-10-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-45136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-45136.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Improve ClosureCleaner to support closures defined in Ammonite REPL
> ---
>
> Key: SPARK-45136
> URL: https://issues.apache.org/jira/browse/SPARK-45136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Vsevolod Stepanov
>Assignee: Vsevolod Stepanov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ConnectRepl uses Ammonite REPL with  CodeClassWrapper to run Scala code. It 
> means that each code cell is wrapped into a separate object. If there are 
> multiple variables defined in the same cell / code block it will lead to 
> capturing extra variables, increasing serialized UDF payload size or making 
> it non-serializable.
> For example, this code
> {code:java}
> // cell 1 
> {
>   val x = 100
>   val y = new NonSerializable
> }
> // cell 2
> spark.range(10).map(i => i + x).agg(sum("value")).collect(){code}
> will fail because lambda will capture both `x` and `y` as they're defined in 
> the same wrapper object



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45658:

Target Version/s:   (was: 3.5.1)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45658:

Affects Version/s: (was: 3.5.1)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45661:


Assignee: Hyukjin Kwon

> Add toNullable in StructType, MapType and ArrayType
> ---
>
> Key: SPARK-45661
> URL: https://issues.apache.org/jira/browse/SPARK-45661
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {{StructType.toNullable}} to return nullable schemas. See 
> https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
>  as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45661.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43523
[https://github.com/apache/spark/pull/43523]

> Add toNullable in StructType, MapType and ArrayType
> ---
>
> Key: SPARK-45661
> URL: https://issues.apache.org/jira/browse/SPARK-45661
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{StructType.toNullable}} to return nullable schemas. See 
> https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
>  as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44440) Use thread pool to perform maintenance activity for hdfs/rocksdb state store providers

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-0:
---
Labels: pull-request-available  (was: )

> Use thread pool to perform maintenance activity for hdfs/rocksdb state store 
> providers
> --
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
>
> Use thread pool to perform maintenance activity for hdfs/rocksdb state store 
> providers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44124:
--
Parent: (was: SPARK-44111)
Issue Type: Improvement  (was: Sub-task)

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45652) SPJ: Handle empty input partitions after dynamic filtering

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45652:
---
Labels: pull-request-available  (was: )

> SPJ: Handle empty input partitions after dynamic filtering
> --
>
> Key: SPARK-45652
> URL: https://issues.apache.org/jira/browse/SPARK-45652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> When the number of input partitions become 0 after dynamic filtering, in 
> {{BatchScanExec}}, currently SPJ will fail with error:
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD(BosonBatchScanExec.scala:28)
>   at 
> org.apache.spark.sql.boson.BosonBatchScanExec.doExecuteColumnar(BosonBatchScanExec.scala:33)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:521)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> {code}
> This is because {{groupPartitions}} will return {{None}} for this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29615) Add insertInto method with byName parameter in DataFrameWriter

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-29615:
---
Labels: pull-request-available  (was: )

> Add insertInto method with byName parameter in DataFrameWriter
> --
>
> Key: SPARK-29615
> URL: https://issues.apache.org/jira/browse/SPARK-29615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the insertion through DataFrameWriter.insertInto method ignores 
> the column names and just uses position-based resolution. As DataFrameWriter 
> only has one public insertInto method, spark users may not check the 
> description of this method and assume Spark will match the columns by name. 
> In such case, wrong column may be used as partition column, which may result 
> in problem (eg. huge amount of files/folders may be created in hive table tmp 
> location).
> We propose to add a new insertInto method in DataFrameWriter which has byName 
> parameter for Spark user to specify whether match columns by name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka"){color:#57d9a3}.option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner"){color}_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

{color:#FF}_system.out.println(configs) //_ _customproperty1 is missing 
here which should be available._{color}

_}_
_..._
_..._

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _customproperty1 is missing here which should 
be available._

_}_
_..._
_..._

_}_

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, 

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _customproperty1 is missing here which should 
be available._

_}_
_..._
_..._

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_
_..._
_..._

_}_

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom 

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_



public class _ipartitioner implemets Partitioner{_

_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom properties as options with 
> Kafka Data Source not just producer configs. Otherwise, custom partitioner 
> can't be 

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_
_..._
_..._

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_

..
...
.
_package com.mycustom;_
_import org.apache.kafka.clients.producer.Partitioner;_
public class _ipartitioner implemets Partitioner{_

_@override_
_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom 

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_



public class _ipartitioner implemets Partitioner{_

_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_


public class _ipartitioner implemets Partitioner{_

_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom properties as options with 
> Kafka Data Source not just producer configs. Otherwise, custom partitioner 
> can't be initialized and implemented as per need.
>  
> For example - 
> 

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

For example - 
_df.write.format("kafka").option("Kafka.customproperty1", 
"value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_


public class _ipartitioner implemets Partitioner{_

_public void configure(Map configs){_

_system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
should be availble._

_}_

_}_

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom properties as options with 
> Kafka Data Source not just producer configs. Otherwise, custom partitioner 
> can't be initialized and implemented as per need.
>  
> For example - 
> _df.write.format("kafka").option("Kafka.customproperty1", 
> "value1").option("kafka.partitioner.class", "com.mycustom.ipartitioner")_
> public class _ipartitioner implemets Partitioner{_
> _public void configure(Map configs){_
> _system.out.println(configs) //_ _Kafka.customproperty1 is missing here which 
> should be availble._
> _}_
> _}_
>  



--
This message was sent by Atlassian Jira

[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -

[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)

 
 
But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 

  was:
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -
|{{[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)}}|

But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 


> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> [configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)
>  
>  
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom properties as options with 
> Kafka Data Source not just producer configs. Otherwise, custom partitioner 
> can't be initialized and implemented as per need.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)
dinesh sachdev created SPARK-45666:
--

 Summary: spark-sql-kafka-0-10_2.11 - Custom Configuration's for 
Partitioner not set.
 Key: SPARK-45666
 URL: https://issues.apache.org/jira/browse/SPARK-45666
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -
|{{[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)}}|

But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 
Reporter: dinesh sachdev






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45666) spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.

2023-10-25 Thread dinesh sachdev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh sachdev updated SPARK-45666:
---
Description: 
We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -
|{{[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)}}|

But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 
Environment: (was: We had a requirement to write Custom 
org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "

Ideally, properties set as part of Producer are available to  Partitioner 
method -
|{{[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
 configs)}}|

But, we realized that Custom properties passed as options to Kafka format 
DataFrameWriter are not available to Partitioner whether we append that 
property with literal "kafka." or not.

Only, Configs listed on - 
[https://kafka.apache.org/documentation/#producerconfigs] were passed to 
Partitioner. But, in some cases it is required to pass custom properties for 
initialization of Partitioner. 

Thus, there should be provision to set custom properties as options with Kafka 
Data Source not just producer configs. Otherwise, custom partitioner can't be 
initialized and implemented as per need.

 

 

 )

> spark-sql-kafka-0-10_2.11 - Custom Configuration's for Partitioner not set.
> ---
>
> Key: SPARK-45666
> URL: https://issues.apache.org/jira/browse/SPARK-45666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dinesh sachdev
>Priority: Major
>
> We had a requirement to write Custom 
> org.apache.kafka.clients.producer.Partitioner to use with Kafka Data Source 
> available with package "{{{}spark-sql-kafka-0-10_2.11{}}} "
> Ideally, properties set as part of Producer are available to  Partitioner 
> method -
> |{{[configure|https://kafka.apache.org/24/javadoc/org/apache/kafka/common/Configurable.html#configure-java.util.Map-]([Map|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html?is-external=true]<[String|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true],?>
>  configs)}}|
> But, we realized that Custom properties passed as options to Kafka format 
> DataFrameWriter are not available to Partitioner whether we append that 
> property with literal "kafka." or not.
> Only, Configs listed on - 
> [https://kafka.apache.org/documentation/#producerconfigs] were passed to 
> Partitioner. But, in some cases it is required to pass custom properties for 
> initialization of Partitioner. 
> Thus, there should be provision to set custom properties as options with 
> Kafka Data Source not just producer configs. Otherwise, custom partitioner 
> can't be initialized and implemented as per need.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45660) Re-use Literal objects when replacing timestamps in the ComputeCurrentTime rule

2023-10-25 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45660.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43524
[https://github.com/apache/spark/pull/43524]

> Re-use Literal objects when replacing timestamps in the ComputeCurrentTime 
> rule
> ---
>
> Key: SPARK-45660
> URL: https://issues.apache.org/jira/browse/SPARK-45660
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The ComputeCurrentTime optimizer rule does produce unique timestamp Literals 
> for current time expressions of a query. For CurrentDate and LocalTimestamp 
> objects only the literal objects are not re-used though, but equal objects 
> are created for each instance. This can cost unnecessary much memory in case 
> there are many such Literal objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45660) Re-use Literal objects when replacing timestamps in the ComputeCurrentTime rule

2023-10-25 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45660:


Assignee: Jan-Ole Sasse

> Re-use Literal objects when replacing timestamps in the ComputeCurrentTime 
> rule
> ---
>
> Key: SPARK-45660
> URL: https://issues.apache.org/jira/browse/SPARK-45660
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Minor
>  Labels: pull-request-available
>
> The ComputeCurrentTime optimizer rule does produce unique timestamp Literals 
> for current time expressions of a query. For CurrentDate and LocalTimestamp 
> objects only the literal objects are not re-used though, but equal objects 
> are created for each instance. This can cost unnecessary much memory in case 
> there are many such Literal objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45545) Pass SSLOptions wherever we create a SparkTransportConf

2023-10-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45545:
--
Summary: Pass SSLOptions wherever we create a SparkTransportConf  (was: 
[CORE] Pass SSLOptions wherever we create a SparkTransportConf)

> Pass SSLOptions wherever we create a SparkTransportConf
> ---
>
> Key: SPARK-45545
> URL: https://issues.apache.org/jira/browse/SPARK-45545
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This ensures that we can properly inherit SSL options and use them 
> everywhere. And tests to ensure things will work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level

2023-10-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40154:
-
Priority: Trivial  (was: Minor)

> PySpark: DataFrame.cache docstring gives wrong storage level
> 
>
> Key: SPARK-40154
> URL: https://issues.apache.org/jira/browse/SPARK-40154
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Paul Staab
>Assignee: Paul Staab
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> The docstring of the `DataFrame.cache()` method currently states that it uses 
> a serialized storage level
> {code:java}
> Persists the :class:`DataFrame` with the default storage level 
> (`MEMORY_AND_DISK`).
> [...]
> -The default storage level has changed to `MEMORY_AND_DISK` to match 
> Scala in 2.0.{code}
> while `DataFrame.persist()` states that it uses a deserialized storage level
> {code:java}
> If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
> [...]
> The default storage level has changed to `MEMORY_AND_DISK_DESER` to match 
> Scala in 3.0.{code}
>  
> However, in practice both `.cache()` and `.persist()` use deserialized 
> storage levels:
> {code:java}
> import pyspark
> from pyspark.sql import SparkSession
> from pyspark import StorageLevel
> print(pyspark.__version__)
> # 3.3.0
> spark = SparkSession.builder.master("local[2]").getOrCreate()
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.cache()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist(StorageLevel.MEMORY_AND_DISK)
> df.count()
> # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level

2023-10-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40154.
--
Fix Version/s: 3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43229
[https://github.com/apache/spark/pull/43229]

> PySpark: DataFrame.cache docstring gives wrong storage level
> 
>
> Key: SPARK-40154
> URL: https://issues.apache.org/jira/browse/SPARK-40154
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Paul Staab
>Assignee: Paul Staab
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0, 3.4.2
>
>
> The docstring of the `DataFrame.cache()` method currently states that it uses 
> a serialized storage level
> {code:java}
> Persists the :class:`DataFrame` with the default storage level 
> (`MEMORY_AND_DISK`).
> [...]
> -The default storage level has changed to `MEMORY_AND_DISK` to match 
> Scala in 2.0.{code}
> while `DataFrame.persist()` states that it uses a deserialized storage level
> {code:java}
> If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
> [...]
> The default storage level has changed to `MEMORY_AND_DISK_DESER` to match 
> Scala in 3.0.{code}
>  
> However, in practice both `.cache()` and `.persist()` use deserialized 
> storage levels:
> {code:java}
> import pyspark
> from pyspark.sql import SparkSession
> from pyspark import StorageLevel
> print(pyspark.__version__)
> # 3.3.0
> spark = SparkSession.builder.master("local[2]").getOrCreate()
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.cache()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist(StorageLevel.MEMORY_AND_DISK)
> df.count()
> # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level

2023-10-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40154:


Assignee: Paul Staab

> PySpark: DataFrame.cache docstring gives wrong storage level
> 
>
> Key: SPARK-40154
> URL: https://issues.apache.org/jira/browse/SPARK-40154
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Paul Staab
>Assignee: Paul Staab
>Priority: Minor
>  Labels: pull-request-available
>
> The docstring of the `DataFrame.cache()` method currently states that it uses 
> a serialized storage level
> {code:java}
> Persists the :class:`DataFrame` with the default storage level 
> (`MEMORY_AND_DISK`).
> [...]
> -The default storage level has changed to `MEMORY_AND_DISK` to match 
> Scala in 2.0.{code}
> while `DataFrame.persist()` states that it uses a deserialized storage level
> {code:java}
> If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`)
> [...]
> The default storage level has changed to `MEMORY_AND_DISK_DESER` to match 
> Scala in 3.0.{code}
>  
> However, in practice both `.cache()` and `.persist()` use deserialized 
> storage levels:
> {code:java}
> import pyspark
> from pyspark.sql import SparkSession
> from pyspark import StorageLevel
> print(pyspark.__version__)
> # 3.3.0
> spark = SparkSession.builder.master("local[2]").getOrCreate()
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.cache()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist()
> df.count()
> # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated"
> df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", 
> "col_b"])
> df = df.persist(StorageLevel.MEMORY_AND_DISK)
> df.count()
> # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779460#comment-17779460
 ] 

Steve Loughran commented on SPARK-44124:


good document

# I think you could consider cutting the kinesis module entirely. its lack of 
maintenance and bug reports indicates nobody uses it.
# we don't have a timetable for a 3.4.x release. end of year for the beta phase 
has been discussed -but I'm not going to manage that release. There's a lot of 
other changes there, and we are still trying to stabilize that v2 code in 
HADOOP-18886 with some fairly major regressions surfacing HADOOP-18945.

one strateigy
# If you cut kinesis then its only the k8s tests which use the sdk...and you 
don't need to explicitly redistribute either sdk jar.
# remove the exclusion of the v1 aws-sdk from hadoop-cloud module imports and 
then all hadoop releases built on v1 get that jar included in the distro
# and if anyone builds with v2 then they'll get the exact v2 release that the 
s3a connector has been built with. this matters as the v2 sdk is a fairly major 
migration and any bugrep against the s3a connector which doesn't use the sdk 
version we released against is going to left to the submitter to troubleshoot.


> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45209) Flame Graph Support For Executor Thread Dump Page

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45209:


Assignee: Kent Yao

> Flame Graph Support For Executor Thread Dump Page
> -
>
> Key: SPARK-45209
> URL: https://issues.apache.org/jira/browse/SPARK-45209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45209) Flame Graph Support For Executor Thread Dump Page

2023-10-25 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45209.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42988
[https://github.com/apache/spark/pull/42988]

> Flame Graph Support For Executor Thread Dump Page
> -
>
> Key: SPARK-45209
> URL: https://issues.apache.org/jira/browse/SPARK-45209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45665) Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled builds in other branches

2023-10-25 Thread Yang Jie (Jira)
Yang Jie created SPARK-45665:


 Summary: Uses different ORACLE_DOCKER_IMAGE_NAME in the scheduled 
builds in other branches
 Key: SPARK-45665
 URL: https://issues.apache.org/jira/browse/SPARK-45665
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-25 Thread tangjiafu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779431#comment-17779431
 ] 

tangjiafu commented on SPARK-45650:
---

in this PR ( https://github.com/apache/spark/pull/43456 )I meet this Error,I 
mistakenly thought it was a problem with the scala version. I am not familiar 
with this and am still researching. If it weren't for this issue, I would close 
it later.

> fix dev/mina get scala 2.12 
> 
>
> Key: SPARK-45650
> URL: https://issues.apache.org/jira/browse/SPARK-45650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: tangjiafu
>Priority: Major
>
> Now the ci is executing  ./dev/mina will generate an incompatible error with 
> scala2.12. Sorry, I don't know how to fix it
> [info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some 
> time)...
> [info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45664) Introduce a mapper for orc compression codecs

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45664:
---
Labels: pull-request-available  (was: )

> Introduce a mapper for orc compression codecs
> -
>
> Key: SPARK-45664
> URL: https://issues.apache.org/jira/browse/SPARK-45664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from orc compression codecs. This issue 
> lead to developers need to manually maintain its consistency. It is easy to 
> make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45664) Introduce a mapper for orc compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45664:
---
Description: 
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.

  was:
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.


> Introduce a mapper for orc compression codecs
> -
>
> Key: SPARK-45664
> URL: https://issues.apache.org/jira/browse/SPARK-45664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from orc compression codecs. This issue 
> lead to developers need to manually maintain its consistency. It is easy to 
> make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45664) Introduce a mapper for orc compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45664:
---
Description: 
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

> Introduce a mapper for orc compression codecs
> -
>
> Key: SPARK-45664
> URL: https://issues.apache.org/jira/browse/SPARK-45664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45481) Introduce a mapper for parquet compression codecs

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45481:
--

Assignee: Apache Spark  (was: Jiaan Geng)

> Introduce a mapper for parquet compression codecs
> -
>
> Key: SPARK-45481
> URL: https://issues.apache.org/jira/browse/SPARK-45481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported all the parquet compression codecs, but the 
> parquet supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce a fake compression codecs none.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45481) Introduce a mapper for parquet compression codecs

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45481:
--

Assignee: Jiaan Geng  (was: Apache Spark)

> Introduce a mapper for parquet compression codecs
> -
>
> Key: SPARK-45481
> URL: https://issues.apache.org/jira/browse/SPARK-45481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported all the parquet compression codecs, but the 
> parquet supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce a fake compression codecs none.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45481) Introduce a mapper for parquet compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45481:
---
Description: 
Currently, Spark supported all the parquet compression codecs, but the parquet 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce a fake compression codecs none.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

  was:
Currently, Spark supported most of all parquet compression codecs, the parquet 
supported compression codecs and spark supported are not completely one-on-one.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.


> Introduce a mapper for parquet compression codecs
> -
>
> Key: SPARK-45481
> URL: https://issues.apache.org/jira/browse/SPARK-45481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported all the parquet compression codecs, but the 
> parquet supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce a fake compression codecs none.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45663:
---
Labels: pull-request-available  (was: )

> Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`
> ---
>
> Key: SPARK-45663
> URL: https://issues.apache.org/jira/browse/SPARK-45663
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> @deprecated("`aggregate` is not relevant for sequential collections. Use 
> `foldLeft(z)(seqop)` instead.", "2.13.0")
> def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
> foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45663:
--

Assignee: (was: Apache Spark)

> Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`
> ---
>
> Key: SPARK-45663
> URL: https://issues.apache.org/jira/browse/SPARK-45663
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> @deprecated("`aggregate` is not relevant for sequential collections. Use 
> `foldLeft(z)(seqop)` instead.", "2.13.0")
> def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
> foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45663:
--

Assignee: Apache Spark

> Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`
> ---
>
> Key: SPARK-45663
> URL: https://issues.apache.org/jira/browse/SPARK-45663
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> @deprecated("`aggregate` is not relevant for sequential collections. Use 
> `foldLeft(z)(seqop)` instead.", "2.13.0")
> def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
> foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45629) Fix `Implicit definition should have explicit type`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45629:
--

Assignee: (was: Apache Spark)

> Fix `Implicit definition should have explicit type`
> ---
>
> Key: SPARK-45629
> URL: https://issues.apache.org/jira/browse/SPARK-45629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:343:16:
>  Implicit definition should have explicit type (inferred 
> org.json4s.DefaultFormats.type) [quickfixable]
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=other-implicit-type, 
> site=org.apache.spark.deploy.TestMasterInfo.formats
> [error]   implicit val formats = org.json4s.DefaultFormats
> [error]   {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45629) Fix `Implicit definition should have explicit type`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45629:
--

Assignee: Apache Spark

> Fix `Implicit definition should have explicit type`
> ---
>
> Key: SPARK-45629
> URL: https://issues.apache.org/jira/browse/SPARK-45629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:343:16:
>  Implicit definition should have explicit type (inferred 
> org.json4s.DefaultFormats.type) [quickfixable]
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=other-implicit-type, 
> site=org.apache.spark.deploy.TestMasterInfo.formats
> [error]   implicit val formats = org.json4s.DefaultFormats
> [error]   {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45661:
--

Assignee: Apache Spark

> Add toNullable in StructType, MapType and ArrayType
> ---
>
> Key: SPARK-45661
> URL: https://issues.apache.org/jira/browse/SPARK-45661
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> {{StructType.toNullable}} to return nullable schemas. See 
> https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
>  as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45661:
--

Assignee: (was: Apache Spark)

> Add toNullable in StructType, MapType and ArrayType
> ---
>
> Key: SPARK-45661
> URL: https://issues.apache.org/jira/browse/SPARK-45661
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {{StructType.toNullable}} to return nullable schemas. See 
> https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
>  as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45629) Fix `Implicit definition should have explicit type`

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45629:
---
Labels: pull-request-available  (was: )

> Fix `Implicit definition should have explicit type`
> ---
>
> Key: SPARK-45629
> URL: https://issues.apache.org/jira/browse/SPARK-45629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:343:16:
>  Implicit definition should have explicit type (inferred 
> org.json4s.DefaultFormats.type) [quickfixable]
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=other-implicit-type, 
> site=org.apache.spark.deploy.TestMasterInfo.formats
> [error]   implicit val formats = org.json4s.DefaultFormats
> [error]   {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45606:
--

Assignee: Jiaan Geng  (was: Apache Spark)

> Release restrictions on multi-layer runtime filter
> --
>
> Key: SPARK-45606
> URL: https://issues.apache.org/jira/browse/SPARK-45606
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
> insert runtime filter for application side of shuffle join on single-layer. 
> Considered it's not worth to insert more runtime filter if one side of the 
> shuffle join already exists runtime filter, Spark restricts it.
> After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports 
> insert runtime filter for one side of any shuffle join on multi-layer. But 
> the restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45606:
--

Assignee: Apache Spark  (was: Jiaan Geng)

> Release restrictions on multi-layer runtime filter
> --
>
> Key: SPARK-45606
> URL: https://issues.apache.org/jira/browse/SPARK-45606
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
> insert runtime filter for application side of shuffle join on single-layer. 
> Considered it's not worth to insert more runtime filter if one side of the 
> shuffle join already exists runtime filter, Spark restricts it.
> After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports 
> insert runtime filter for one side of any shuffle join on multi-layer. But 
> the restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45606:
--

Assignee: Jiaan Geng  (was: Apache Spark)

> Release restrictions on multi-layer runtime filter
> --
>
> Key: SPARK-45606
> URL: https://issues.apache.org/jira/browse/SPARK-45606
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
> insert runtime filter for application side of shuffle join on single-layer. 
> Considered it's not worth to insert more runtime filter if one side of the 
> shuffle join already exists runtime filter, Spark restricts it.
> After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports 
> insert runtime filter for one side of any shuffle join on multi-layer. But 
> the restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45606:
--

Assignee: Apache Spark  (was: Jiaan Geng)

> Release restrictions on multi-layer runtime filter
> --
>
> Key: SPARK-45606
> URL: https://issues.apache.org/jira/browse/SPARK-45606
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
> insert runtime filter for application side of shuffle join on single-layer. 
> Considered it's not worth to insert more runtime filter if one side of the 
> shuffle join already exists runtime filter, Spark restricts it.
> After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports 
> insert runtime filter for one side of any shuffle join on multi-layer. But 
> the restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-39910:
--

Assignee: (was: Apache Spark)

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-39910:
--

Assignee: Apache Spark

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Assignee: Apache Spark
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779407#comment-17779407
 ] 

Lantao Jin commented on SPARK-44124:


Is it possible to convert this JIRA to an issue for adding more sub-tasks under 
this one? [~dongjoon]

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779405#comment-17779405
 ] 

Lantao Jin commented on SPARK-44124:


Added a design doc (more like a plan) in description.

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-44124:
---
Description: 
Here is a design doc:
https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45656) Fix observation when named observations with the same name on different datasets.

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45656:


Assignee: Takuya Ueshin

> Fix observation when named observations with the same name on different 
> datasets.
> -
>
> Key: SPARK-45656
> URL: https://issues.apache.org/jira/browse/SPARK-45656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45656) Fix observation when named observations with the same name on different datasets.

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45656.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43519
[https://github.com/apache/spark/pull/43519]

> Fix observation when named observations with the same name on different 
> datasets.
> -
>
> Key: SPARK-45656
> URL: https://issues.apache.org/jira/browse/SPARK-45656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779397#comment-17779397
 ] 

Hyukjin Kwon commented on SPARK-45662:
--

Can you provide a reproducer please

> Pyspark  union method throw 'pyspark.sql.utils.IllegalArgumentException: 
> requirement failed'
> 
>
> Key: SPARK-45662
> URL: https://issues.apache.org/jira/browse/SPARK-45662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yi Zhu
>Priority: Major
> Attachments: image-2023-10-25-15-01-28-896.png
>
>
> !image-2023-10-25-15-01-28-896.png|width=1795,height=194!
>  
>  
> The code is 
> {code:java}
>     output_df = spark.createDataFrame([], output_schema())
>     output_df.printSchema()
>     print(output_df.count()) {code}
> I have checked that the schema is same, `final` is from a sql query.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45663) Replace `IterableOnceOps#aggregate` with `IterableOnceOps#foldLeft`

2023-10-25 Thread Yang Jie (Jira)
Yang Jie created SPARK-45663:


 Summary: Replace `IterableOnceOps#aggregate` with 
`IterableOnceOps#foldLeft`
 Key: SPARK-45663
 URL: https://issues.apache.org/jira/browse/SPARK-45663
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie


{code:java}
@deprecated("`aggregate` is not relevant for sequential collections. Use 
`foldLeft(z)(seqop)` instead.", "2.13.0")
def aggregate[B](z: => B)(seqop: (B, A) => B, combop: (B, B) => B): B = 
foldLeft(z)(seqop) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45650) fix dev/mina get scala 2.12

2023-10-25 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779383#comment-17779383
 ] 

Yang Jie commented on SPARK-45650:
--

Do you have a more detailed error stack? `[info] [launcher] getting Scala 
2.12.18 (for sbt)...` is because sbt 1.x uses scala 2.12, it won't affect the 
result of dev/mima.

> fix dev/mina get scala 2.12 
> 
>
> Key: SPARK-45650
> URL: https://issues.apache.org/jira/browse/SPARK-45650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: tangjiafu
>Priority: Major
>
> Now the ci is executing  ./dev/mina will generate an incompatible error with 
> scala2.12. Sorry, I don't know how to fix it
> [info] [launcher] getting org.scala-sbt sbt 1.9.3  (this may take some 
> time)...
> [info] [launcher] getting Scala 2.12.18 (for sbt)...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45660) Re-use Literal objects when replacing timestamps in the ComputeCurrentTime rule

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45660:
---
Labels: pull-request-available  (was: )

> Re-use Literal objects when replacing timestamps in the ComputeCurrentTime 
> rule
> ---
>
> Key: SPARK-45660
> URL: https://issues.apache.org/jira/browse/SPARK-45660
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Jan-Ole Sasse
>Priority: Minor
>  Labels: pull-request-available
>
> The ComputeCurrentTime optimizer rule does produce unique timestamp Literals 
> for current time expressions of a query. For CurrentDate and LocalTimestamp 
> objects only the literal objects are not re-used though, but equal objects 
> are created for each instance. This can cost unnecessary much memory in case 
> there are many such Literal objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Yi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779370#comment-17779370
 ] 

Yi Zhu commented on SPARK-45662:


ping [~gurwls223] Could you take a look? I have checked that the env is correct 
and python is 3.6

> Pyspark  union method throw 'pyspark.sql.utils.IllegalArgumentException: 
> requirement failed'
> 
>
> Key: SPARK-45662
> URL: https://issues.apache.org/jira/browse/SPARK-45662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yi Zhu
>Priority: Major
> Attachments: image-2023-10-25-15-01-28-896.png
>
>
> !image-2023-10-25-15-01-28-896.png|width=1795,height=194!
>  
>  
> The code is 
> {code:java}
>     output_df = spark.createDataFrame([], output_schema())
>     output_df.printSchema()
>     print(output_df.count()) {code}
> I have checked that the schema is same, `final` is from a sql query.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Yi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhu updated SPARK-45662:
---
Description: 
!image-2023-10-25-15-01-28-896.png|width=1795,height=194!

 

 

The code is 
{code:java}
    output_df = spark.createDataFrame([], output_schema())
    output_df.printSchema()
    print(output_df.count()) {code}
I have checked that the schema is same, `final` is from a sql query.

 

  was:!image-2023-10-25-15-01-28-896.png|width=1795,height=194!


> Pyspark  union method throw 'pyspark.sql.utils.IllegalArgumentException: 
> requirement failed'
> 
>
> Key: SPARK-45662
> URL: https://issues.apache.org/jira/browse/SPARK-45662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yi Zhu
>Priority: Major
> Attachments: image-2023-10-25-15-01-28-896.png
>
>
> !image-2023-10-25-15-01-28-896.png|width=1795,height=194!
>  
>  
> The code is 
> {code:java}
>     output_df = spark.createDataFrame([], output_schema())
>     output_df.printSchema()
>     print(output_df.count()) {code}
> I have checked that the schema is same, `final` is from a sql query.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Yi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhu updated SPARK-45662:
---
Attachment: image-2023-10-25-15-01-28-896.png

> Pyspark  union method throw 'pyspark.sql.utils.IllegalArgumentException: 
> requirement failed'
> 
>
> Key: SPARK-45662
> URL: https://issues.apache.org/jira/browse/SPARK-45662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yi Zhu
>Priority: Major
> Attachments: image-2023-10-25-15-01-28-896.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Yi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhu updated SPARK-45662:
---
Description: !image-2023-10-25-15-01-28-896.png|width=1795,height=194!

> Pyspark  union method throw 'pyspark.sql.utils.IllegalArgumentException: 
> requirement failed'
> 
>
> Key: SPARK-45662
> URL: https://issues.apache.org/jira/browse/SPARK-45662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yi Zhu
>Priority: Major
> Attachments: image-2023-10-25-15-01-28-896.png
>
>
> !image-2023-10-25-15-01-28-896.png|width=1795,height=194!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45662) Pyspark union method throw 'pyspark.sql.utils.IllegalArgumentException: requirement failed'

2023-10-25 Thread Yi Zhu (Jira)
Yi Zhu created SPARK-45662:
--

 Summary: Pyspark  union method throw 
'pyspark.sql.utils.IllegalArgumentException: requirement failed'
 Key: SPARK-45662
 URL: https://issues.apache.org/jira/browse/SPARK-45662
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Yi Zhu
 Attachments: image-2023-10-25-15-01-28-896.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45640) Fix flaky ProtobufCatalystDataConversionSuite

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45640:
-
Fix Version/s: 3.4.2
   3.5.1

> Fix flaky ProtobufCatalystDataConversionSuite
> -
>
> Key: SPARK-45640
> URL: https://issues.apache.org/jira/browse/SPARK-45640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper

2023-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45588:
-
Fix Version/s: 3.4.2
   3.5.1

> Minor scaladoc improvement in StreamingForeachBatchHelper
> -
>
> Key: SPARK-45588
> URL: https://issues.apache.org/jira/browse/SPARK-45588
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Document RunnerCleaner in StreamingForeachBatchHelper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45661:
---
Labels: pull-request-available  (was: )

> Add toNullable in StructType, MapType and ArrayType
> ---
>
> Key: SPARK-45661
> URL: https://issues.apache.org/jira/browse/SPARK-45661
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {{StructType.toNullable}} to return nullable schemas. See 
> https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
>  as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45661) Add toNullable in StructType, MapType and ArrayType

2023-10-25 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-45661:


 Summary: Add toNullable in StructType, MapType and ArrayType
 Key: SPARK-45661
 URL: https://issues.apache.org/jira/browse/SPARK-45661
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


{{StructType.toNullable}} to return nullable schemas. See 
https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe
 as an example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43108) org.apache.spark.storage.StorageStatus NotSerializableException when try to access StorageStatus in a MapPartitionsFunction

2023-10-25 Thread Berg Lloyd-Haig (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779355#comment-17779355
 ] 

Berg Lloyd-Haig commented on SPARK-43108:
-

We are hitting this when reading from Snowflake using Spark 3.4 then printing 
some data to stdout or writing to Parquet.

It only occurs on the invocation, then the error no longer appears.

> org.apache.spark.storage.StorageStatus NotSerializableException when try to 
> access StorageStatus in a MapPartitionsFunction
> ---
>
> Key: SPARK-43108
> URL: https://issues.apache.org/jira/browse/SPARK-43108
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: surender godara
>Priority: Minor
>
> When you try to access the *storage status 
> (org.apache.spark.storage.StorageStatus)* inside a MapPartitionsFunction,then 
> getStorageStatus method throw the NotSerializableException. This exception is 
> thrown because the StorageStatus object is not serializable.
> Here is an example code snippet that demonstrates how to access the storage 
> status inside a MapPartitionsFunction in Spark:
> {code:java}
> StorageStatus[] storageStatus = 
> SparkEnv.get().blockManager().master().getStorageStatus();{code}
> *Error stacktrace --*
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.storage.StorageStatus
> Serialization stack:
>     - object not serializable (class: org.apache.spark.storage.StorageStatus, 
> value: org.apache.spark.storage.StorageStatus@715b4e82)
>     - element of array (index: 0)
>     - array (class [Lorg.apache.spark.storage.StorageStatus;, size 2)
>     at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>     at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>     at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>     at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286)
>     at 
> org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64)
>     at 
> org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32)
>     at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:156)
>     at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>     at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>     at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>     at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>     at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264){code}
> *Steps to reproduce*
> step 1  Initialize spark session with spark standalone mode.
> step 2  Create a Dataset using the SparkSession and load data
> step 3  Define the MapPartitionsFunction on Dataset and get storage status 
> inside it.
> Here is the code snippet of MapPartitionsFunction
>  
> {code:java}
> df = df.mapPartitions(new MapPartitionsFunction() {
>             @Override
>             public Iterator call(Iterator input) throws Exception {
>                 StorageStatus[] storageStatus = 
> SparkEnv.get().blockManager().master().getStorageStatus();
>                 return input;
>             }
>         }, RowEncoder.apply(df.schema()));
> {code}
>  
> Step4 - submit the spark job. 
>  
> *Solution -*
> Implement the Serializable interface for 
> org.apache.spark.storage.StorageStatus.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-25 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Description: 
I do not really know if this is a bug, but I am at the end with my knowledge.

A Spark job ran successfully with Spark 3.2.x and 3.3.x. 

But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job with 
the same data the following always occurs now:
{code}
scala.Some is not a valid external type for schema of array
{code}

The corresponding stacktrace is:
{code}
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
worker for task 0.0 in stage 0.0 (TID 0)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.scheduler.Task.run(Task.scala:141) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
 ~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
 ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
~[spark-core_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
[spark-core_2.12-3.5.0.jar:3.5.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
worker for task 1.0 in stage 0.0 (TID 1)"
java.lang.RuntimeException: scala.Some is not a valid external type for schema 
of array
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 
org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
 ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at 

[jira] [Updated] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-25 Thread Adi Wehrli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adi Wehrli updated SPARK-45644:
---
Issue Type: Bug  (was: Question)

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> 

  1   2   >