date:20240828

[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues

2024-08-28 Thread vipin Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266
 ] 

vipin Kumar commented on SPARK-49442:
-

[~kabhwan] 

we applied *metadata.max.age.ms* on direct Kafka consumer config not through 
Spark config 

> Complete Metadata requests on each micro batch causing Kafka issues
> ---
>
> Key: SPARK-49442
> URL: https://issues.apache.org/jira/browse/SPARK-49442
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: vipin Kumar
>Priority: Major
>  Labels: Kafka, spark-streaming-kafka
>
> We have noticed that spark does complete metadata requests on each micro 
> batch and this is causing high metadata requests on small micro batch 
> intervals .
>  
> For example Kafka with 1900 partitions and 10 sec micro batch we are seeing 
> order of 
> ~{*}360K{*} metadata requests / sec 
> Same with job with 60 sec micro batch we are observing *~60K* meta data 
> requests.
>  
> Metadata requests are controlled by *metadata.max.age.ms* but these config 
> have no effect on spark consumers by default its 5 mins still we are seeing 
> these huge requests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues

2024-08-28 Thread vipin Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266
 ] 

vipin Kumar edited comment on SPARK-49442 at 8/28/24 7:29 AM:
--

Thanks [~kabhwan]  for quick reply 

we applied *metadata.max.age.ms* on direct Kafka consumer config not through 
Spark config 

will apply for spark config also 


was (Author: vipin77):
[~kabhwan] 

we applied *metadata.max.age.ms* on direct Kafka consumer config not through 
Spark config 

> Complete Metadata requests on each micro batch causing Kafka issues
> ---
>
> Key: SPARK-49442
> URL: https://issues.apache.org/jira/browse/SPARK-49442
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: vipin Kumar
>Priority: Major
>  Labels: Kafka, spark-streaming-kafka
>
> We have noticed that spark does complete metadata requests on each micro 
> batch and this is causing high metadata requests on small micro batch 
> intervals .
>  
> For example Kafka with 1900 partitions and 10 sec micro batch we are seeing 
> order of 
> ~{*}360K{*} metadata requests / sec 
> Same with job with 60 sec micro batch we are observing *~60K* meta data 
> requests.
>  
> Metadata requests are controlled by *metadata.max.age.ms* but these config 
> have no effect on spark consumers by default its 5 mins still we are seeing 
> these huge requests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues

2024-08-28 Thread vipin Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266
 ] 

vipin Kumar edited comment on SPARK-49442 at 8/28/24 7:29 AM:
--

Thanks [~kabhwan]  for quick reply 

we applied *metadata.max.age.ms* on direct Kafka consumer config not through 
Spark config 

will check other config also 


was (Author: vipin77):
Thanks [~kabhwan]  for quick reply 

we applied *metadata.max.age.ms* on direct Kafka consumer config not through 
Spark config 

will apply for spark config also 

> Complete Metadata requests on each micro batch causing Kafka issues
> ---
>
> Key: SPARK-49442
> URL: https://issues.apache.org/jira/browse/SPARK-49442
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: vipin Kumar
>Priority: Major
>  Labels: Kafka, spark-streaming-kafka
>
> We have noticed that spark does complete metadata requests on each micro 
> batch and this is causing high metadata requests on small micro batch 
> intervals .
>  
> For example Kafka with 1900 partitions and 10 sec micro batch we are seeing 
> order of 
> ~{*}360K{*} metadata requests / sec 
> Same with job with 60 sec micro batch we are observing *~60K* meta data 
> requests.
>  
> Metadata requests are controlled by *metadata.max.age.ms* but these config 
> have no effect on spark consumers by default its 5 mins still we are seeing 
> these huge requests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49439) Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression

2024-08-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-49439:


Assignee: BingKun Pan

> Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression 
> 
>
> Key: SPARK-49439
> URL: https://issues.apache.org/jira/browse/SPARK-49439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49439) Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression

2024-08-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-49439.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47901
[https://github.com/apache/spark/pull/47901]

> Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression 
> 
>
> Key: SPARK-49439
> URL: https://issues.apache.org/jira/browse/SPARK-49439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49438) Fix the pretty name of the `FromAvro` & `ToAvro` expression

2024-08-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-49438.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47900
[https://github.com/apache/spark/pull/47900]

>  Fix the pretty name of the `FromAvro` & `ToAvro` expression
> 
>
> Key: SPARK-49438
> URL: https://issues.apache.org/jira/browse/SPARK-49438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49438) Fix the pretty name of the `FromAvro` & `ToAvro` expression

2024-08-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-49438:


Assignee: BingKun Pan

>  Fix the pretty name of the `FromAvro` & `ToAvro` expression
> 
>
> Key: SPARK-49438
> URL: https://issues.apache.org/jira/browse/SPARK-49438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49383) Support Transpose DataFrame API

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49383:
--

Assignee: Apache Spark

> Support Transpose DataFrame API
> ---
>
> Key: SPARK-49383
> URL: https://issues.apache.org/jira/browse/SPARK-49383
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Support Transpose as Scala/Python DataFrame API in both Spark Connect and 
> Classic Spark.
> Transposing data is a crucial operation in data analysis, enabling the 
> transformation of rows into columns. This operation is widely used in tools 
> like pandas and numpy, allowing for more flexible data manipulation and 
> visualization.
> While Apache Spark supports unpivot and pivot operations, it currently lacks 
> a built-in transpose function. Implementing a transpose operation in Spark 
> would enhance its data processing capabilities, aligning it with the 
> functionalities available in pandas and numpy, and further empowering users 
> in their data analysis workflows.
> Please see 
> [https://docs.google.com/document/d/1QSmG81qQ-muab0UOeqgDAELqF7fJTH8GnxCJF4Ir-kA/edit]
>  for a detailed design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread Harsh Motwani (Jira)

Harsh Motwani created SPARK-49443:
-

 Summary: Implement to_variant_object expression and make 
schema_of_variant expressions print OBJECT for for Variant Objects
 Key: SPARK-49443
 URL: https://issues.apache.org/jira/browse/SPARK-49443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Harsh Motwani


Cast from structs to variant objects should not be legal since variant objects 
are unordered bags of key-value pairs while structs are ordered sets of 
elements of fixed types. Therefore, casts between structs and variant objects 
do not behave like casts between structs. Example (produced by Serge Rielau):


{code:java}
scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct)").show()
++
|named_struct(c, 1, b, 2)|

++
|{1, 2}|

++

Passing a struct into VARIANT loses the position
scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
struct)").show()
+-+
|CAST(named_struct(c, 1, b, 2) AS VARIANT)|

+-+
|{2, 1}|

+-+
{code}

Casts from maps to variant objects should also not be legal since they 
represent completely orthogonal data types. Maps can represent a variable 
number of key value pairs based on just a key and value type in the schema but 
in objects, the schema (produced by schema_of_variant expressions) will have a 
type corresponding to each value in the object. Objects can have values of 
different types while maps cannot and objects can only have string keys while 
maps can also have complex keys.

We should therefore prohibit the existing behavior of allowing explicit casts 
from structs and maps to variants as the variant spec currently only supports 
an object type which is remotely compatible with structs and maps. We should 
introduce a new expression that converts schemas containing structs and maps to 
variants. We will call it `to_variant_object`.

Also, schema_of_variant and schema_of_variant_agg expressions currently print 
STRUCT when Variant Objects are observed. We should also correct that to OBJECT.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49443:
---
Labels: pull-request-available  (was: )

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: Apache Spark

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: (was: Apache Spark)

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: (was: Apache Spark)

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: Apache Spark

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49410) Update collation benchmarks

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49410:
--

Assignee: Apache Spark

> Update collation benchmarks
> ---
>
> Key: SPARK-49410
> URL: https://issues.apache.org/jira/browse/SPARK-49410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49410) Update collation benchmarks

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49410:
--

Assignee: (was: Apache Spark)

> Update collation benchmarks
> ---
>
> Key: SPARK-49410
> URL: https://issues.apache.org/jira/browse/SPARK-49410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: Apache Spark

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49443:
--

Assignee: (was: Apache Spark)

> Implement to_variant_object expression and make schema_of_variant expressions 
> print OBJECT for for Variant Objects
> --
>
> Key: SPARK-49443
> URL: https://issues.apache.org/jira/browse/SPARK-49443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Cast from structs to variant objects should not be legal since variant 
> objects are unordered bags of key-value pairs while structs are ordered sets 
> of elements of fixed types. Therefore, casts between structs and variant 
> objects do not behave like casts between structs. Example (produced by Serge 
> Rielau):
> {code:java}
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show()
> ++
> |named_struct(c, 1, b, 2)|
> ++
> |{1, 2}|
> ++
> Passing a struct into VARIANT loses the position
> scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as 
> struct)").show()
> +-+
> |CAST(named_struct(c, 1, b, 2) AS VARIANT)|
> +-+
> |{2, 1}|
> +-+
> {code}
> Casts from maps to variant objects should also not be legal since they 
> represent completely orthogonal data types. Maps can represent a variable 
> number of key value pairs based on just a key and value type in the schema 
> but in objects, the schema (produced by schema_of_variant expressions) will 
> have a type corresponding to each value in the object. Objects can have 
> values of different types while maps cannot and objects can only have string 
> keys while maps can also have complex keys.
> We should therefore prohibit the existing behavior of allowing explicit casts 
> from structs and maps to variants as the variant spec currently only supports 
> an object type which is remotely compatible with structs and maps. We should 
> introduce a new expression that converts schemas containing structs and maps 
> to variants. We will call it `to_variant_object`.
> Also, schema_of_variant and schema_of_variant_agg expressions currently print 
> STRUCT when Variant Objects are observed. We should also correct that to 
> OBJECT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43242) diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId

2024-08-28 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43242.
-
Fix Version/s: 4.0.0
 Assignee: Zhang Liang
   Resolution: Fixed

> diagnoseCorruption should not throw Unexpected type of BlockId for 
> ShuffleBlockBatchId
> --
>
> Key: SPARK-43242
> URL: https://issues.apache.org/jira/browse/SPARK-43242
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.4
>Reporter: Zhang Liang
>Assignee: Zhang Liang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Some of our spark app throw "Unexpected type of BlockId" exception as shown 
> below
> According to BlockId.scala, we can found format such as 
> *shuffle_12_5868_518_523* is type of `ShuffleBlockBatchId`, which is not 
> handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`.
>  
> Moreover, the new exception thrown in `diagnose` swallow the real exception 
> in certain cases. Since diagnoseCorruption is always used in exception 
> handling as a side dish, I think it shouldn't throw exception at all
>  
> {code:java}
> 23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task 
> 104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: 
> Unexpected type of BlockId, shuffle_12_5868_518_523 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at
>  
> org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314)
>  at scala.Option.map(Option.scala:230)at 
> org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
> java.io.DataInputStream.read(DataInputStream.java:149) at 
> org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
> org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>  at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024)
>  at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201)
>  at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240)
>  at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
>  at 
> org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$a

[jira] [Created] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread Jira

Vladan Vasić created SPARK-49444:


 Summary: Univocity parser handles ArrayIndexOutOfBounds exception
 Key: SPARK-49444
 URL: https://issues.apache.org/jira/browse/SPARK-49444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.3
Reporter: Vladan Vasić


The current implementation of `UnivocityParser` throws `ArrayIndexOutOfBounds` 
exception when parsing a csv record with more columns than set in options as 
maximum. This case was reproduced in the `UnivocityParserSuite`.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49119) Fix the inconsistency of syntax `show columns` between v1 and v2

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49119:
--

Assignee: Apache Spark

> Fix the inconsistency of syntax `show columns` between v1 and v2
> 
>
> Key: SPARK-49119
> URL: https://issues.apache.org/jira/browse/SPARK-49119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49444:
---
Labels: pull-request-available  (was: )

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49444:
--

Assignee: Apache Spark

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49444:
--

Assignee: (was: Apache Spark)

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877301#comment-17877301
 ] 

ASF GitHub Bot commented on SPARK-49444:


User 'vladanvasi-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/47906

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Priority: Minor
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877302#comment-17877302
 ] 

ASF GitHub Bot commented on SPARK-49444:


User 'vladanvasi-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/47906

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49444:
--

Assignee: (was: Apache Spark)

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-49444:
--

Assignee: Apache Spark

> Univocity parser handles ArrayIndexOutOfBounds exception
> 
>
> Key: SPARK-49444
> URL: https://issues.apache.org/jira/browse/SPARK-49444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Vladan Vasić
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of `UnivocityParser` throws 
> `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns 
> than set in options as maximum. This case was reproduced in the 
> `UnivocityParserSuite`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues

2024-08-28 Thread vipin Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877314#comment-17877314
 ] 

vipin Kumar commented on SPARK-49442:
-

Hi [~kabhwan] 

 *I don't know whether the massive requests are from driver vs executor.*
We are seeing these requests from all the executors and they are evenly 
distributed.

*SQL config "spark.sql.streaming.kafka.useDeprecatedOffsetFetching" to 
"false"?* 

This has no effect on the requests.

> Complete Metadata requests on each micro batch causing Kafka issues
> ---
>
> Key: SPARK-49442
> URL: https://issues.apache.org/jira/browse/SPARK-49442
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: vipin Kumar
>Priority: Major
>  Labels: Kafka, spark-streaming-kafka
>
> We have noticed that spark does complete metadata requests on each micro 
> batch and this is causing high metadata requests on small micro batch 
> intervals .
>  
> For example Kafka with 1900 partitions and 10 sec micro batch we are seeing 
> order of 
> ~{*}360K{*} metadata requests / sec 
> Same with job with 60 sec micro batch we are observing *~60K* meta data 
> requests.
>  
> Metadata requests are controlled by *metadata.max.age.ms* but these config 
> have no effect on spark consumers by default its 5 mins still we are seeing 
> these huge requests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-46037:
-
Priority: Blocker  (was: Minor)

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49445) Support show tooltip in the progress bar of UI

2024-08-28 Thread dzcxzl (Jira)

dzcxzl created SPARK-49445:
--

 Summary: Support show tooltip in the progress bar of UI
 Key: SPARK-49445
 URL: https://issues.apache.org/jira/browse/SPARK-49445
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 4.0.0
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49445) Support show tooltip in the progress bar of UI

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49445:
---
Labels: pull-request-available  (was: )

> Support show tooltip in the progress bar of UI
> --
>
> Key: SPARK-49445
> URL: https://issues.apache.org/jira/browse/SPARK-49445
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49409) CONNECT_SESSION_PLAN_CACHE_SIZE is too small for certain programming patterns

2024-08-28 Thread Changgyoo Park (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877366#comment-17877366
 ] 

Changgyoo Park commented on SPARK-49409:


Yes, because there is "a" case where 5 is insufficient because of unrelated 
data frames between very complicated dependent data frames. I'm pretty sure 
that just increasing the default value is not the best idea, so ideally, the 
analysed plan should be stored on the client side (this will be super 
difficult, I know that), removing the plan cache completely, but until then, 
increasing it to ~16 would cover much more cases.

> CONNECT_SESSION_PLAN_CACHE_SIZE is too small for certain programming patterns
> -
>
> Key: SPARK-49409
> URL: https://issues.apache.org/jira/browse/SPARK-49409
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Changgyoo Park
>Priority: Major
>
> Example:
>  
> ```
> df_1 = df_a.filter(col('X').isNotNull())
> df_2 = df_b.filter(col('SAFE_SU_Conv').isNotNull())
> 
> df_x = ...
> for _ in range(0, 5):
>     df_x = df_x.select(...)
> ...
> df_3 = df_1.join(df_2, ...)
> ```
> => df_x completely invalidates all the cached entries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49029) Create a shared interface for Dataset

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-49029.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47882
[https://github.com/apache/spark/pull/47882]

> Create a shared interface for Dataset
> -
>
> Key: SPARK-49029
> URL: https://issues.apache.org/jira/browse/SPARK-49029
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Create a shared Dataset interface in org.apache.spark.sql.api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)

2024-08-28 Thread Jiri Humpolicek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877372#comment-17877372
 ] 

Jiri Humpolicek commented on SPARK-34638:
-

Hi, I currently tested similar example in spark-3.5.1 but I suppose that it 
will be same result in all versions after the fix in 3.2.0.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemId2": 1, "itemData": "a"},
   {"itemId": 2, "itemId2": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(explode($"items").as('item)).select($"item.itemId", 
$"item.itemData").explain(true)
// ReadSchema: 
struct>>
{code}
So it seems that when I use more than one field from structure after explode 
the resulting query reads whole structure, instead of only fields which I 
accessed.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49446) Upgrade jetty to 11.0.23

2024-08-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-49446:


 Summary: Upgrade jetty to 11.0.23
 Key: SPARK-49446
 URL: https://issues.apache.org/jira/browse/SPARK-49446
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49446) Upgrade jetty to 11.0.23

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49446:
---
Labels: pull-request-available  (was: )

> Upgrade jetty to 11.0.23
> 
>
> Key: SPARK-49446
> URL: https://issues.apache.org/jira/browse/SPARK-49446
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues

2024-08-28 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877397#comment-17877397
 ] 

Jungtaek Lim commented on SPARK-49442:
--

OK, that's unrelated. We haven't got any report for this kind of complaint. I 
recommend you to provide a minimal reproducer e.g. Apache Spark cluster & 
Apache Kafka cluster (no vendor version and no cloud service version), topic 
partition to 3-5 and increase topic-partition and prove that the metadata 
requests increase linearly, with the detailed explanation about how you capture 
the requests. If you are relying on any vendor rather than building the cluster 
on your own, it'd be ideal to contact with the support.

> Complete Metadata requests on each micro batch causing Kafka issues
> ---
>
> Key: SPARK-49442
> URL: https://issues.apache.org/jira/browse/SPARK-49442
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: vipin Kumar
>Priority: Major
>  Labels: Kafka, spark-streaming-kafka
>
> We have noticed that spark does complete metadata requests on each micro 
> batch and this is causing high metadata requests on small micro batch 
> intervals .
>  
> For example Kafka with 1900 partitions and 10 sec micro batch we are seeing 
> order of 
> ~{*}360K{*} metadata requests / sec 
> Same with job with 60 sec micro batch we are observing *~60K* meta data 
> requests.
>  
> Metadata requests are controlled by *metadata.max.age.ms* but these config 
> have no effect on spark consumers by default its 5 mins still we are seeing 
> these huge requests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45745) Extremely slow execution of sum of columns in Spark 3.4.1

2024-08-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45745.
---
Resolution: Duplicate

> Extremely slow execution of sum of columns in Spark 3.4.1
> -
>
> Key: SPARK-45745
> URL: https://issues.apache.org/jira/browse/SPARK-45745
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.1
>Reporter: Javier
>Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to 
> Spark 3.4.1 and some code that was running fine is now basically never ending 
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark 
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
> "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the 
> computation time seems to explode when the value of `n_cols` is bigger than 
> about 25 columns. A colleague suggested that it could be related to the limit 
> of 22 elements in a tuple in Scala 2.13 
> (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 
> columns are suspiciously close to this. Is there any known defect in the 
> logical plan optimization in 3.4.1? Or is this kind of operations (sum of 
> multiple columns) supposed to be implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49313) Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-49313:
-

Assignee: BingKun Pan

> Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version
> -
>
> Key: SPARK-49313
> URL: https://issues.apache.org/jira/browse/SPARK-49313
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49313) Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-49313.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47809
[https://github.com/apache/spark/pull/47809]

> Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version
> -
>
> Key: SPARK-49313
> URL: https://issues.apache.org/jira/browse/SPARK-49313
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49399) Add examples for different Spark image types

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-49399.
---
Fix Version/s: kubernetes-operator-0.1.0
   Resolution: Fixed

Issue resolved by pull request 108
[https://github.com/apache/spark-kubernetes-operator/pull/108]

> Add examples for different Spark image types
> 
>
> Key: SPARK-49399
> URL: https://issues.apache.org/jira/browse/SPARK-49399
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49399) Add examples for different Spark image types

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-49399:
-

Assignee: Zhou JIANG

> Add examples for different Spark image types
> 
>
> Key: SPARK-49399
> URL: https://issues.apache.org/jira/browse/SPARK-49399
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49399) Add `pi-scala.yaml` and `pyspark-pi.yaml`

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-49399:
--
Summary: Add `pi-scala.yaml` and `pyspark-pi.yaml`  (was: Add examples for 
different Spark image types)

> Add `pi-scala.yaml` and `pyspark-pi.yaml`
> -
>
> Key: SPARK-49399
> URL: https://issues.apache.org/jira/browse/SPARK-49399
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-49447:
--
Description: The default value is `1s` (=1000). Usually, a small value like 
`1` happens when users do mistakes and forget to add the unit, `s`.

> Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less 
> than 100
> ---
>
> Key: SPARK-49447
> URL: https://issues.apache.org/jira/browse/SPARK-49447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The default value is `1s` (=1000). Usually, a small value like `1` happens 
> when users do mistakes and forget to add the unit, `s`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49447:
---
Labels: pull-request-available  (was: )

> Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less 
> than 100
> ---
>
> Key: SPARK-49447
> URL: https://issues.apache.org/jira/browse/SPARK-49447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> The default value is `1s` (=1000). Usually, a small value like `1` happens 
> when users do mistakes and forget to add the unit, `s`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-49447:
-

Assignee: Dongjoon Hyun

> Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less 
> than 100
> ---
>
> Key: SPARK-49447
> URL: https://issues.apache.org/jira/browse/SPARK-49447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> The default value is `1s` (=1000). Usually, a small value like `1` happens 
> when users do mistakes and forget to add the unit, `s`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48781) Add Catalog APIs for loading stored procedures

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48781.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47190
[https://github.com/apache/spark/pull/47190]

> Add Catalog APIs for loading stored procedures
> --
>
> Key: SPARK-48781
> URL: https://issues.apache.org/jira/browse/SPARK-48781
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add new connector catalog APIs for loading stored procedures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48781) Add Catalog APIs for loading stored procedures

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48781:
-

Assignee: Anton Okolnychyi

> Add Catalog APIs for loading stored procedures
> --
>
> Key: SPARK-48781
> URL: https://issues.apache.org/jira/browse/SPARK-48781
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
>
> Add new connector catalog APIs for loading stored procedures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41262:
---
Labels: pull-request-available  (was: )

> Enable canChangeCachedPlanOutputPartitioning by default
> ---
>
> Key: SPARK-41262
> URL: https://issues.apache.org/jira/browse/SPARK-41262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> Remove the `internal` tag of 
> `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from 
> false to true by default to make AQE work with cached plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48400) Promote `PrometheusServlet` to `DeveloperApi`

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48400.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46716
[https://github.com/apache/spark/pull/46716]

> Promote `PrometheusServlet` to `DeveloperApi`
> -
>
> Key: SPARK-48400
> URL: https://issues.apache.org/jira/browse/SPARK-48400
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45923) Spark Kubernetes Operator

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45923.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Spark Kubernetes Operator
> -
>
> Key: SPARK-45923
> URL: https://issues.apache.org/jira/browse/SPARK-45923
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou Jiang
>Assignee: Zhou Jiang
>Priority: Major
>  Labels: SPIP, pull-request-available
> Fix For: 4.0.0
>
>
> We would like to develop a Java-based Kubernetes operator for Apache Spark. 
> Following the operator pattern 
> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> users may manage applications and related components seamlessly using native 
> tools like kubectl. The primary goal is to simplify the Spark user experience 
> on Kubernetes, minimizing the learning curve and operational complexities and 
> therefore enable users to focus on the Spark application development.
> Ideally, it would reside in a separate repository (like Spark docker or Spark 
> connect golang) and be loosely connected to the Spark release cycle while 
> supporting multiple Spark versions.
> SPIP doc: 
> [https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE|https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE/edit#heading=h.hhham7siu2vi]
> Dev email discussion : 
> [https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45923) Spark Kubernetes Operator

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45923:
--
Labels: SPIP releasenotes  (was: SPIP pull-request-available)

> Spark Kubernetes Operator
> -
>
> Key: SPARK-45923
> URL: https://issues.apache.org/jira/browse/SPARK-45923
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou Jiang
>Assignee: Zhou Jiang
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 4.0.0
>
>
> We would like to develop a Java-based Kubernetes operator for Apache Spark. 
> Following the operator pattern 
> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> users may manage applications and related components seamlessly using native 
> tools like kubectl. The primary goal is to simplify the Spark user experience 
> on Kubernetes, minimizing the learning curve and operational complexities and 
> therefore enable users to focus on the Spark application development.
> Ideally, it would reside in a separate repository (like Spark docker or Spark 
> connect golang) and be loosely connected to the Spark release cycle while 
> supporting multiple Spark versions.
> SPIP doc: 
> [https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE|https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE/edit#heading=h.hhham7siu2vi]
> Dev email discussion : 
> [https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.

2024-08-28 Thread LIU (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-49448:

Description: 
{code:java}
//代码占位符
{code}
private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends 
Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ 
override def run(): Unit = { try { execute() onCompletionPromise.success(()) } 
catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } }

 

execute method end with ErrorUtils.handleError() function call.  if any 
excetion throw. it will not catch by promise. is it better to catch real 
exceptions with promises instead of.

  was:
private class ExecutionThread(onCompletionPromise: Promise[Unit])
extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") {
override def run(): Unit = {
try {
execute()
onCompletionPromise.success(())
} catch {
case NonFatal(e) =>
onCompletionPromise.failure(e)
}
}
}

 

execute method end with ErrorUtils.handleError() function call.  if any 
excetion throw. it will not catch by promise. is it better to catch real 
exceptions with promises instead of.


> Spark Connect ExecuteThreadRunner promise will always complete with success.
> 
>
> Key: SPARK-49448
> URL: https://issues.apache.org/jira/browse/SPARK-49448
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: LIU
>Priority: Minor
>
> {code:java}
> //代码占位符
> {code}
> private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends 
> Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ 
> override def run(): Unit = { try { execute() onCompletionPromise.success(()) 
> } catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } }
>  
> execute method end with ErrorUtils.handleError() function call.  if any 
> excetion throw. it will not catch by promise. is it better to catch real 
> exceptions with promises instead of.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.

2024-08-28 Thread LIU (Jira)

LIU created SPARK-49448:
---

 Summary: Spark Connect ExecuteThreadRunner promise will always 
complete with success.
 Key: SPARK-49448
 URL: https://issues.apache.org/jira/browse/SPARK-49448
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: LIU


private class ExecutionThread(onCompletionPromise: Promise[Unit])
extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") {
override def run(): Unit = {
try {
execute()
onCompletionPromise.success(())
} catch {
case NonFatal(e) =>
onCompletionPromise.failure(e)
}
}
}

 

execute method end with ErrorUtils.handleError() function call.  if any 
excetion throw. it will not catch by promise. is it better to catch real 
exceptions with promises instead of.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.

2024-08-28 Thread LIU (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIU updated SPARK-49448:

Description: 
{code:java}
private class ExecutionThread(onCompletionPromise: Promise[Unit])
extends 
Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") {
  override def run(): Unit = {
try {
  execute()
  onCompletionPromise.success(())
} catch {
  case NonFatal(e) =>
onCompletionPromise.failure(e)
}
  }
}{code}
 

execute method end with ErrorUtils.handleError() function call.  if any 
excetion throw. it will not catch by promise. is it better to catch real 
exceptions with promises instead of？  if wants. i will submit this change.

  was:
{code:java}
//代码占位符
{code}
private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends 
Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ 
override def run(): Unit = { try { execute() onCompletionPromise.success(()) } 
catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } }

 

execute method end with ErrorUtils.handleError() function call.  if any 
excetion throw. it will not catch by promise. is it better to catch real 
exceptions with promises instead of.


> Spark Connect ExecuteThreadRunner promise will always complete with success.
> 
>
> Key: SPARK-49448
> URL: https://issues.apache.org/jira/browse/SPARK-49448
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: LIU
>Priority: Minor
>
> {code:java}
> private class ExecutionThread(onCompletionPromise: Promise[Unit])
> extends 
> Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") {
>   override def run(): Unit = {
> try {
>   execute()
>   onCompletionPromise.success(())
> } catch {
>   case NonFatal(e) =>
> onCompletionPromise.failure(e)
> }
>   }
> }{code}
>  
> execute method end with ErrorUtils.handleError() function call.  if any 
> excetion throw. it will not catch by promise. is it better to catch real 
> exceptions with promises instead of？  if wants. i will submit this change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49449) Remove string and binary from metadata in spec

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49449:
---
Labels: pull-request-available  (was: )

> Remove string and binary from metadata in spec
> --
>
> Key: SPARK-49449
> URL: https://issues.apache.org/jira/browse/SPARK-49449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Priority: Major
>  Labels: pull-request-available
>
> We never supported the string-from-metadata or binary-from-metadata. Remove 
> them for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49450) Improve normalised collation names

2024-08-28 Thread Mihailo Milosevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-49450:
--
Parent: (was: SPARK-46830)
Issue Type: Improvement  (was: Sub-task)

> Improve normalised collation names
> --
>
> Key: SPARK-49450
> URL: https://issues.apache.org/jira/browse/SPARK-49450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49421) Create a shared RelationalGroupedDataset interface

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49421:
---
Labels: pull-request-available  (was: )

> Create a shared RelationalGroupedDataset interface
> --
>
> Key: SPARK-49421
> URL: https://issues.apache.org/jira/browse/SPARK-49421
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
> Environment: Not sure if we should do this. Connect and Classic have 
> different semantics, so unification is a bit tricky.
>Reporter: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49421) Create a shared RelationalGroupedDataset interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-49421:
-

Assignee: Herman van Hövell

> Create a shared RelationalGroupedDataset interface
> --
>
> Key: SPARK-49421
> URL: https://issues.apache.org/jira/browse/SPARK-49421
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
> Environment: Not sure if we should do this. Connect and Classic have 
> different semantics, so unification is a bit tricky.
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49419) Create a shared DataFrameStatFunctions interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-49419:
-

Assignee: Herman van Hövell

> Create a shared DataFrameStatFunctions interface
> 
>
> Key: SPARK-49419
> URL: https://issues.apache.org/jira/browse/SPARK-49419
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49450) Improve normalised collation names

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49450:
---
Labels: pull-request-available  (was: )

> Improve normalised collation names
> --
>
> Key: SPARK-49450
> URL: https://issues.apache.org/jira/browse/SPARK-49450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-49422:
--
Description: This dhpou

> Create a shared KeyValueGroupedDataset interface
> 
>
> Key: SPARK-49422
> URL: https://issues.apache.org/jira/browse/SPARK-49422
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
> Environment: Not sure if we should do this. Connect and Classic have 
> different semantics, so unification is a bit tricky.
>Reporter: Herman van Hövell
>Priority: Major
>
> This dhpou



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-49422:
--
Environment: (was: Not sure if we should do this. Connect and Classic 
have different semantics, so unification is a bit tricky.)

> Create a shared KeyValueGroupedDataset interface
> 
>
> Key: SPARK-49422
> URL: https://issues.apache.org/jira/browse/SPARK-49422
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Priority: Major
>
> This dhpou



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49422) Create a shared KeyValueGroupedDataset interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-49422:
-

Assignee: Herman van Hövell

> Create a shared KeyValueGroupedDataset interface
> 
>
> Key: SPARK-49422
> URL: https://issues.apache.org/jira/browse/SPARK-49422
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> This dhpou



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-49422:
--
Description: This should also implement RelationalGroupedDataset.as[K: 
Encoder, T: Encoder]: KeyValueGroupedDataset[K, T].  (was: This dhpou)

> Create a shared KeyValueGroupedDataset interface
> 
>
> Key: SPARK-49422
> URL: https://issues.apache.org/jira/browse/SPARK-49422
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> This should also implement RelationalGroupedDataset.as[K: Encoder, T: 
> Encoder]: KeyValueGroupedDataset[K, T].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49451) Allow duplicate keys in parse_json.

2024-08-28 Thread Chenhao Li (Jira)

Chenhao Li created SPARK-49451:
--

 Summary: Allow duplicate keys in parse_json.
 Key: SPARK-49451
 URL: https://issues.apache.org/jira/browse/SPARK-49451
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-49423) Consolidate Observation into a single class in sql/api

2024-08-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-49423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-49423:
-

Assignee: Herman van Hövell

> Consolidate Observation into a single class in sql/api
> --
>
> Key: SPARK-49423
> URL: https://issues.apache.org/jira/browse/SPARK-49423
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
> Environment: Not sure if we should do this. Connect and Classic have 
> different semantics, so unification is a bit tricky.
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Move the implementation specific bits out of the class, and only keep the 
> Observation class. While we are at it, let's also replace the homegrown 
> threading stuff by futures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49451) Allow duplicate keys in parse_json.

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49451:
---
Labels: pull-request-available  (was: )

> Allow duplicate keys in parse_json.
> ---
>
> Key: SPARK-49451
> URL: https://issues.apache.org/jira/browse/SPARK-49451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49423) Consolidate Observation into a single class in sql/api

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49423:
---
Labels: pull-request-available  (was: )

> Consolidate Observation into a single class in sql/api
> --
>
> Key: SPARK-49423
> URL: https://issues.apache.org/jira/browse/SPARK-49423
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
> Environment: Not sure if we should do this. Connect and Classic have 
> different semantics, so unification is a bit tricky.
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>
> Move the implementation specific bits out of the class, and only keep the 
> Observation class. While we are at it, let's also replace the homegrown 
> threading stuff by futures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49425) Create a shared DataFrameWriter interface

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49425:
---
Labels: pull-request-available  (was: )

> Create a shared DataFrameWriter interface
> -
>
> Key: SPARK-49425
> URL: https://issues.apache.org/jira/browse/SPARK-49425
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-49447.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47913
[https://github.com/apache/spark/pull/47913]

> Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less 
> than 100
> ---
>
> Key: SPARK-49447
> URL: https://issues.apache.org/jira/browse/SPARK-49447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The default value is `1s` (=1000). Usually, a small value like `1` happens 
> when users do mistakes and forget to add the unit, `s`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46995) Allow AQE coalesce final stage in SQL cached plan

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46995:
---
Labels: pull-request-available  (was: )

> Allow AQE coalesce final stage in SQL cached plan
> -
>
> Key: SPARK-46995
> URL: https://issues.apache.org/jira/browse/SPARK-46995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/43435] and 
> [https://github.com/apache/spark/pull/43760] are fixing a correctness issue 
> which will be triggered when AQE applied on cached query plan, specifically, 
> when AQE coalescing the final result stage of the cached plan.
>  
> The current semantic of 
> {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}}
> ([source 
> code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L403-L411]):
>  * when true, we enable AQE, but disable coalescing final stage 
> ({*}default{*})
>  * when false, we disable AQE
>  
> But let’s revisit the semantic of this config: actually for caller the only 
> thing that matters is whether we change the output partitioning of the cached 
> plan. And we should only try to apply AQE if possible.  Thus we want to 
> modify the semantic of 
> {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}}
>  * when true, we enable AQE and allow coalescing final: this might lead to 
> perf regression, because it introduce extra shuffle
>  * when false, we enable AQE, but disable coalescing final stage. *(this is 
> actually the `true` semantic of old behavior)*
> Also, to keep the default behavior unchanged, we might want to flip the 
> default value of 
> {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} to `false`
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49453) spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding Failure

2024-08-28 Thread Qi Tan (Jira)

Qi Tan created SPARK-49453:
--

 Summary: spark-kubernetes-operator-dynamic-configuration ConfigMap 
Data Overriding Failure
 Key: SPARK-49453
 URL: https://issues.apache.org/jira/browse/SPARK-49453
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Qi Tan


I have a value.yaml as below:
operatorConfiguration:
    dynamicConfig:
    enable: true
    create: true
    data:
        spark.kubernetes.operator.watchedNamespaces: "default, spark-1"

helm install spark-kubernetes-operator --create-namespace -f 
build-tools/helm/spark-kubernetes-operator/values.yaml -f 
tests/e2e/helm/dynamic-config-values.yaml 
build-tools/helm/spark-kubernetes-operator/

The generated configmap data field does not contains the line 
spark.kubernetes.operator.watchedNamespaces: "default, spark-1". Note that if 
you run helm install --dry-run, the record exist



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49454) Avoid double normalization in the cache process

2024-08-28 Thread Xinyi Yu (Jira)

Xinyi Yu created SPARK-49454:


 Summary: Avoid double normalization in the cache process
 Key: SPARK-49454
 URL: https://issues.apache.org/jira/browse/SPARK-49454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Xinyi Yu


There is an issue introduced in 
[#46465|https://github.com/apache/spark/pull/46465], which is that 
normalization is applied twice during the cache process. Some normalization 
rules may not be idempotent, so applying them repeatedly may break the plan 
shape and cause an unexpected cache miss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46037.
-
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47905
[https://github.com/apache/spark/pull/47905]

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46037:
---

Assignee: mcdull_zhang

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-49446) Upgrade jetty to 11.0.23

2024-08-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-49446.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47912
[https://github.com/apache/spark/pull/47912]

> Upgrade jetty to 11.0.23
> 
>
> Key: SPARK-49446
> URL: https://issues.apache.org/jira/browse/SPARK-49446
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42879) Spark SQL reads unnecessary nested fields

2024-08-28 Thread Jiri Humpolicek (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiri Humpolicek updated SPARK-42879:

Affects Version/s: 3.5.2

> Spark SQL reads unnecessary nested fields
> -
>
> Key: SPARK-42879
> URL: https://issues.apache.org/jira/browse/SPARK-42879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2, 3.5.2
>Reporter: Jiri Humpolicek
>Priority: Major
>
> When we use more than one field from structure after explode, all fields will 
> be read.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData1": "a", "itemData2": 11},
>{"itemId": 2, "itemData1": "b", "itemData2": 22}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read
> .select(explode('items).as('item))
> .select($"item.itemId", $"item.itemData1")
> .explain
> // ReadSchema: 
> struct>>
> {code}
> We use only *itemId* and *itemData1* fields from structure in array, but read 
> schema contains *itemData2* field as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49455) Refactor `StagingInMemoryTableCatalog` to override the non-deprecated functions

2024-08-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-49455:


 Summary: Refactor `StagingInMemoryTableCatalog` to override the 
non-deprecated functions
 Key: SPARK-49455
 URL: https://issues.apache.org/jira/browse/SPARK-49455
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49455) Refactor `StagingInMemoryTableCatalog` to override the non-deprecated functions

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49455:
---
Labels: pull-request-available  (was: )

> Refactor `StagingInMemoryTableCatalog` to override the non-deprecated 
> functions
> ---
>
> Key: SPARK-49455
> URL: https://issues.apache.org/jira/browse/SPARK-49455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49456) Spark website doesn't properly scroll to hash links

2024-08-28 Thread Neil Ramaswamy (Jira)

Neil Ramaswamy created SPARK-49456:
--

 Summary: Spark website doesn't properly scroll to hash links 
 Key: SPARK-49456
 URL: https://issues.apache.org/jira/browse/SPARK-49456
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Neil Ramaswamy


On the version-specific Spark documentation, if you click a header, the page 
will scroll past the actual content, hiding it. For example, if you go to [this 
link|https://spark.apache.org/docs/latest/#downloading], you'll probably notice 
the page scroll past "Downloads".

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)

2024-08-28 Thread Jiri Humpolicek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877595#comment-17877595
 ] 

Jiri Humpolicek commented on SPARK-34638:
-

[~viirya] Do you think it would be possible to do that? I think it will be 
great feature when spark reads only necessary fields from query in general way. 
In case of rich nested structures it could safe huge amount of resources.  I 
found unresolved improvement for this more general case from last year 
https://issues.apache.org/jira/browse/SPARK-42879 .

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)

2024-08-28 Thread Jiri Humpolicek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877595#comment-17877595
 ] 

Jiri Humpolicek edited comment on SPARK-34638 at 8/29/24 6:03 AM:
--

[~viirya] Do you think it would be possible to do that? I think it will be 
great feature when spark reads only necessary fields from query in general way. 
In case of rich nested structures it could save huge amount of resources.  I 
found unresolved improvement for this more general case from last year 
https://issues.apache.org/jira/browse/SPARK-42879 .


was (Author: yuryn):
[~viirya] Do you think it would be possible to do that? I think it will be 
great feature when spark reads only necessary fields from query in general way. 
In case of rich nested structures it could safe huge amount of resources.  I 
found unresolved improvement for this more general case from last year 
https://issues.apache.org/jira/browse/SPARK-42879 .

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-34638
> URL: https://issues.apache.org/jira/browse/SPARK-34638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Jiri Humpolicek
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] 
> I found another nested fields pruning case.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49456) Spark website doesn't properly scroll to hash links

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49456:
---
Labels: pull-request-available  (was: )

> Spark website doesn't properly scroll to hash links 
> 
>
> Key: SPARK-49456
> URL: https://issues.apache.org/jira/browse/SPARK-49456
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
>
> On the version-specific Spark documentation, if you click a header, the page 
> will scroll past the actual content, hiding it. For example, if you go to 
> [this link|https://spark.apache.org/docs/latest/#downloading], you'll 
> probably notice the page scroll past "Downloads".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49457) Remove uncommon curl option --retry-all-errors

2024-08-28 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-49457:
-

 Summary: Remove uncommon curl option --retry-all-errors
 Key: SPARK-49457
 URL: https://issues.apache.org/jira/browse/SPARK-49457
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49457) Remove uncommon curl option --retry-all-errors

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49457:
---
Labels: pull-request-available  (was: )

> Remove uncommon curl option --retry-all-errors
> --
>
> Key: SPARK-49457
> URL: https://issues.apache.org/jira/browse/SPARK-49457
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49259) Size based partition creation during kafka read

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49259:
---
Labels: pull-request-available  (was: )

> Size based partition creation during kafka read
> ---
>
> Key: SPARK-49259
> URL: https://issues.apache.org/jira/browse/SPARK-49259
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Subham Singhal
>Priority: Minor
>  Labels: pull-request-available
>
> Currently Spark + kafka structured streaming provides *minPartitions* config 
> to create more number of partitions than kafka has. This is helpful to 
> increase parallelism but this value is can not be changed dynamically. 
> It would be better to dynamically increase spark partitions based on input 
> size, if input size is high create more partitions. We can take *avg msg 
> size* and *maxBytesPerPartition* as input and dynamically create partitions 
> to handle varying loads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49453) spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding Failure

2024-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-49453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49453:
---
Labels: pull-request-available  (was: )

> spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding 
> Failure
> -
>
> Key: SPARK-49453
> URL: https://issues.apache.org/jira/browse/SPARK-49453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Qi Tan
>Priority: Trivial
>  Labels: pull-request-available
>
> I have a value.yaml as below:
> operatorConfiguration:
>     dynamicConfig:
>     enable: true
>     create: true
>     data:
>         spark.kubernetes.operator.watchedNamespaces: "default, spark-1"
> helm install spark-kubernetes-operator --create-namespace -f 
> build-tools/helm/spark-kubernetes-operator/values.yaml -f 
> tests/e2e/helm/dynamic-config-values.yaml 
> build-tools/helm/spark-kubernetes-operator/
> The generated configmap data field does not contains the line 
> spark.kubernetes.operator.watchedNamespaces: "default, spark-1". Note that if 
> you run helm install --dry-run, the record exist



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

89 matches

Mail list logo