[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues
[ https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266 ] vipin Kumar commented on SPARK-49442: - [~kabhwan] we applied *metadata.max.age.ms* on direct Kafka consumer config not through Spark config > Complete Metadata requests on each micro batch causing Kafka issues > --- > > Key: SPARK-49442 > URL: https://issues.apache.org/jira/browse/SPARK-49442 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: vipin Kumar >Priority: Major > Labels: Kafka, spark-streaming-kafka > > We have noticed that spark does complete metadata requests on each micro > batch and this is causing high metadata requests on small micro batch > intervals . > > For example Kafka with 1900 partitions and 10 sec micro batch we are seeing > order of > ~{*}360K{*} metadata requests / sec > Same with job with 60 sec micro batch we are observing *~60K* meta data > requests. > > Metadata requests are controlled by *metadata.max.age.ms* but these config > have no effect on spark consumers by default its 5 mins still we are seeing > these huge requests. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues
[ https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266 ] vipin Kumar edited comment on SPARK-49442 at 8/28/24 7:29 AM: -- Thanks [~kabhwan] for quick reply we applied *metadata.max.age.ms* on direct Kafka consumer config not through Spark config will apply for spark config also was (Author: vipin77): [~kabhwan] we applied *metadata.max.age.ms* on direct Kafka consumer config not through Spark config > Complete Metadata requests on each micro batch causing Kafka issues > --- > > Key: SPARK-49442 > URL: https://issues.apache.org/jira/browse/SPARK-49442 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: vipin Kumar >Priority: Major > Labels: Kafka, spark-streaming-kafka > > We have noticed that spark does complete metadata requests on each micro > batch and this is causing high metadata requests on small micro batch > intervals . > > For example Kafka with 1900 partitions and 10 sec micro batch we are seeing > order of > ~{*}360K{*} metadata requests / sec > Same with job with 60 sec micro batch we are observing *~60K* meta data > requests. > > Metadata requests are controlled by *metadata.max.age.ms* but these config > have no effect on spark consumers by default its 5 mins still we are seeing > these huge requests. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues
[ https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877266#comment-17877266 ] vipin Kumar edited comment on SPARK-49442 at 8/28/24 7:29 AM: -- Thanks [~kabhwan] for quick reply we applied *metadata.max.age.ms* on direct Kafka consumer config not through Spark config will check other config also was (Author: vipin77): Thanks [~kabhwan] for quick reply we applied *metadata.max.age.ms* on direct Kafka consumer config not through Spark config will apply for spark config also > Complete Metadata requests on each micro batch causing Kafka issues > --- > > Key: SPARK-49442 > URL: https://issues.apache.org/jira/browse/SPARK-49442 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: vipin Kumar >Priority: Major > Labels: Kafka, spark-streaming-kafka > > We have noticed that spark does complete metadata requests on each micro > batch and this is causing high metadata requests on small micro batch > intervals . > > For example Kafka with 1900 partitions and 10 sec micro batch we are seeing > order of > ~{*}360K{*} metadata requests / sec > Same with job with 60 sec micro batch we are observing *~60K* meta data > requests. > > Metadata requests are controlled by *metadata.max.age.ms* but these config > have no effect on spark consumers by default its 5 mins still we are seeing > these huge requests. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49439) Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression
[ https://issues.apache.org/jira/browse/SPARK-49439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-49439: Assignee: BingKun Pan > Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression > > > Key: SPARK-49439 > URL: https://issues.apache.org/jira/browse/SPARK-49439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49439) Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression
[ https://issues.apache.org/jira/browse/SPARK-49439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-49439. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47901 [https://github.com/apache/spark/pull/47901] > Fix the pretty name of the `FromProtobuf` & `ToProtobuf` expression > > > Key: SPARK-49439 > URL: https://issues.apache.org/jira/browse/SPARK-49439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49438) Fix the pretty name of the `FromAvro` & `ToAvro` expression
[ https://issues.apache.org/jira/browse/SPARK-49438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-49438. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47900 [https://github.com/apache/spark/pull/47900] > Fix the pretty name of the `FromAvro` & `ToAvro` expression > > > Key: SPARK-49438 > URL: https://issues.apache.org/jira/browse/SPARK-49438 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49438) Fix the pretty name of the `FromAvro` & `ToAvro` expression
[ https://issues.apache.org/jira/browse/SPARK-49438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-49438: Assignee: BingKun Pan > Fix the pretty name of the `FromAvro` & `ToAvro` expression > > > Key: SPARK-49438 > URL: https://issues.apache.org/jira/browse/SPARK-49438 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49383) Support Transpose DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-49383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49383: -- Assignee: Apache Spark > Support Transpose DataFrame API > --- > > Key: SPARK-49383 > URL: https://issues.apache.org/jira/browse/SPARK-49383 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Support Transpose as Scala/Python DataFrame API in both Spark Connect and > Classic Spark. > Transposing data is a crucial operation in data analysis, enabling the > transformation of rows into columns. This operation is widely used in tools > like pandas and numpy, allowing for more flexible data manipulation and > visualization. > While Apache Spark supports unpivot and pivot operations, it currently lacks > a built-in transpose function. Implementing a transpose operation in Spark > would enhance its data processing capabilities, aligning it with the > functionalities available in pandas and numpy, and further empowering users > in their data analysis workflows. > Please see > [https://docs.google.com/document/d/1QSmG81qQ-muab0UOeqgDAELqF7fJTH8GnxCJF4Ir-kA/edit] > for a detailed design. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
Harsh Motwani created SPARK-49443: - Summary: Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects Key: SPARK-49443 URL: https://issues.apache.org/jira/browse/SPARK-49443 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Harsh Motwani Cast from structs to variant objects should not be legal since variant objects are unordered bags of key-value pairs while structs are ordered sets of elements of fixed types. Therefore, casts between structs and variant objects do not behave like casts between structs. Example (produced by Serge Rielau): {code:java} scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct)").show() ++ |named_struct(c, 1, b, 2)| ++ |{1, 2}| ++ Passing a struct into VARIANT loses the position scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as struct)").show() +-+ |CAST(named_struct(c, 1, b, 2) AS VARIANT)| +-+ |{2, 1}| +-+ {code} Casts from maps to variant objects should also not be legal since they represent completely orthogonal data types. Maps can represent a variable number of key value pairs based on just a key and value type in the schema but in objects, the schema (produced by schema_of_variant expressions) will have a type corresponding to each value in the object. Objects can have values of different types while maps cannot and objects can only have string keys while maps can also have complex keys. We should therefore prohibit the existing behavior of allowing explicit casts from structs and maps to variants as the variant spec currently only supports an object type which is remotely compatible with structs and maps. We should introduce a new expression that converts schemas containing structs and maps to variants. We will call it `to_variant_object`. Also, schema_of_variant and schema_of_variant_agg expressions currently print STRUCT when Variant Objects are observed. We should also correct that to OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49443: --- Labels: pull-request-available (was: ) > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: Apache Spark > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: (was: Apache Spark) > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: (was: Apache Spark) > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: Apache Spark > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49410) Update collation benchmarks
[ https://issues.apache.org/jira/browse/SPARK-49410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49410: -- Assignee: Apache Spark > Update collation benchmarks > --- > > Key: SPARK-49410 > URL: https://issues.apache.org/jira/browse/SPARK-49410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49410) Update collation benchmarks
[ https://issues.apache.org/jira/browse/SPARK-49410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49410: -- Assignee: (was: Apache Spark) > Update collation benchmarks > --- > > Key: SPARK-49410 > URL: https://issues.apache.org/jira/browse/SPARK-49410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: Apache Spark > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49443) Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
[ https://issues.apache.org/jira/browse/SPARK-49443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49443: -- Assignee: (was: Apache Spark) > Implement to_variant_object expression and make schema_of_variant expressions > print OBJECT for for Variant Objects > -- > > Key: SPARK-49443 > URL: https://issues.apache.org/jira/browse/SPARK-49443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Harsh Motwani >Priority: Major > Labels: pull-request-available > > Cast from structs to variant objects should not be legal since variant > objects are unordered bags of key-value pairs while structs are ordered sets > of elements of fixed types. Therefore, casts between structs and variant > objects do not behave like casts between structs. Example (produced by Serge > Rielau): > {code:java} > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct c int>)").show() > ++ > |named_struct(c, 1, b, 2)| > ++ > |{1, 2}| > ++ > Passing a struct into VARIANT loses the position > scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as > struct)").show() > +-+ > |CAST(named_struct(c, 1, b, 2) AS VARIANT)| > +-+ > |{2, 1}| > +-+ > {code} > Casts from maps to variant objects should also not be legal since they > represent completely orthogonal data types. Maps can represent a variable > number of key value pairs based on just a key and value type in the schema > but in objects, the schema (produced by schema_of_variant expressions) will > have a type corresponding to each value in the object. Objects can have > values of different types while maps cannot and objects can only have string > keys while maps can also have complex keys. > We should therefore prohibit the existing behavior of allowing explicit casts > from structs and maps to variants as the variant spec currently only supports > an object type which is remotely compatible with structs and maps. We should > introduce a new expression that converts schemas containing structs and maps > to variants. We will call it `to_variant_object`. > Also, schema_of_variant and schema_of_variant_agg expressions currently print > STRUCT when Variant Objects are observed. We should also correct that to > OBJECT. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43242) diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId
[ https://issues.apache.org/jira/browse/SPARK-43242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43242. - Fix Version/s: 4.0.0 Assignee: Zhang Liang Resolution: Fixed > diagnoseCorruption should not throw Unexpected type of BlockId for > ShuffleBlockBatchId > -- > > Key: SPARK-43242 > URL: https://issues.apache.org/jira/browse/SPARK-43242 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.4 >Reporter: Zhang Liang >Assignee: Zhang Liang >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Some of our spark app throw "Unexpected type of BlockId" exception as shown > below > According to BlockId.scala, we can found format such as > *shuffle_12_5868_518_523* is type of `ShuffleBlockBatchId`, which is not > handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`. > > Moreover, the new exception thrown in `diagnose` swallow the real exception > in certain cases. Since diagnoseCorruption is always used in exception > handling as a side dish, I think it shouldn't throw exception at all > > {code:java} > 23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task > 104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: > Unexpected type of BlockId, shuffle_12_5868_518_523 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at > > org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314) > at scala.Option.map(Option.scala:230)at > org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313) > at > org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at > java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at > java.io.BufferedInputStream.read(BufferedInputStream.java:345) at > java.io.DataInputStream.read(DataInputStream.java:149) at > org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at > org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at > scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225) > at > org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$a
[jira] [Created] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
Vladan Vasić created SPARK-49444: Summary: Univocity parser handles ArrayIndexOutOfBounds exception Key: SPARK-49444 URL: https://issues.apache.org/jira/browse/SPARK-49444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.3 Reporter: Vladan Vasić The current implementation of `UnivocityParser` throws `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns than set in options as maximum. This case was reproduced in the `UnivocityParserSuite`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49119) Fix the inconsistency of syntax `show columns` between v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-49119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49119: -- Assignee: Apache Spark > Fix the inconsistency of syntax `show columns` between v1 and v2 > > > Key: SPARK-49119 > URL: https://issues.apache.org/jira/browse/SPARK-49119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49444: --- Labels: pull-request-available (was: ) > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49444: -- Assignee: Apache Spark > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49444: -- Assignee: (was: Apache Spark) > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877301#comment-17877301 ] ASF GitHub Bot commented on SPARK-49444: User 'vladanvasi-db' has created a pull request for this issue: https://github.com/apache/spark/pull/47906 > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Priority: Minor > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877302#comment-17877302 ] ASF GitHub Bot commented on SPARK-49444: User 'vladanvasi-db' has created a pull request for this issue: https://github.com/apache/spark/pull/47906 > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49444: -- Assignee: (was: Apache Spark) > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49444) Univocity parser handles ArrayIndexOutOfBounds exception
[ https://issues.apache.org/jira/browse/SPARK-49444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-49444: -- Assignee: Apache Spark > Univocity parser handles ArrayIndexOutOfBounds exception > > > Key: SPARK-49444 > URL: https://issues.apache.org/jira/browse/SPARK-49444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Vladan Vasić >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > > The current implementation of `UnivocityParser` throws > `ArrayIndexOutOfBounds` exception when parsing a csv record with more columns > than set in options as maximum. This case was reproduced in the > `UnivocityParserSuite`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues
[ https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877314#comment-17877314 ] vipin Kumar commented on SPARK-49442: - Hi [~kabhwan] *I don't know whether the massive requests are from driver vs executor.* We are seeing these requests from all the executors and they are evenly distributed. *SQL config "spark.sql.streaming.kafka.useDeprecatedOffsetFetching" to "false"?* This has no effect on the requests. > Complete Metadata requests on each micro batch causing Kafka issues > --- > > Key: SPARK-49442 > URL: https://issues.apache.org/jira/browse/SPARK-49442 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: vipin Kumar >Priority: Major > Labels: Kafka, spark-streaming-kafka > > We have noticed that spark does complete metadata requests on each micro > batch and this is causing high metadata requests on small micro batch > intervals . > > For example Kafka with 1900 partitions and 10 sec micro batch we are seeing > order of > ~{*}360K{*} metadata requests / sec > Same with job with 60 sec micro batch we are observing *~60K* meta data > requests. > > Metadata requests are controlled by *metadata.max.age.ms* but these config > have no effect on spark consumers by default its 5 mins still we are seeing > these huge requests. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results
[ https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46037: - Priority: Blocker (was: Minor) > When Left Join build Left, ShuffledHashJoinExec may result in incorrect > results > --- > > Key: SPARK-46037 > URL: https://issues.apache.org/jira/browse/SPARK-46037 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: mcdull_zhang >Priority: Blocker > Labels: correctness, pull-request-available > > When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may > have incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49445) Support show tooltip in the progress bar of UI
dzcxzl created SPARK-49445: -- Summary: Support show tooltip in the progress bar of UI Key: SPARK-49445 URL: https://issues.apache.org/jira/browse/SPARK-49445 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49445) Support show tooltip in the progress bar of UI
[ https://issues.apache.org/jira/browse/SPARK-49445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49445: --- Labels: pull-request-available (was: ) > Support show tooltip in the progress bar of UI > -- > > Key: SPARK-49445 > URL: https://issues.apache.org/jira/browse/SPARK-49445 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: dzcxzl >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49409) CONNECT_SESSION_PLAN_CACHE_SIZE is too small for certain programming patterns
[ https://issues.apache.org/jira/browse/SPARK-49409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877366#comment-17877366 ] Changgyoo Park commented on SPARK-49409: Yes, because there is "a" case where 5 is insufficient because of unrelated data frames between very complicated dependent data frames. I'm pretty sure that just increasing the default value is not the best idea, so ideally, the analysed plan should be stored on the client side (this will be super difficult, I know that), removing the plan cache completely, but until then, increasing it to ~16 would cover much more cases. > CONNECT_SESSION_PLAN_CACHE_SIZE is too small for certain programming patterns > - > > Key: SPARK-49409 > URL: https://issues.apache.org/jira/browse/SPARK-49409 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Changgyoo Park >Priority: Major > > Example: > > ``` > df_1 = df_a.filter(col('X').isNotNull()) > df_2 = df_b.filter(col('SAFE_SU_Conv').isNotNull()) > > df_x = ... > for _ in range(0, 5): > df_x = df_x.select(...) > ... > df_3 = df_1.join(df_2, ...) > ``` > => df_x completely invalidates all the cached entries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49029) Create a shared interface for Dataset
[ https://issues.apache.org/jira/browse/SPARK-49029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-49029. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47882 [https://github.com/apache/spark/pull/47882] > Create a shared interface for Dataset > - > > Key: SPARK-49029 > URL: https://issues.apache.org/jira/browse/SPARK-49029 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Create a shared Dataset interface in org.apache.spark.sql.api. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)
[ https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877372#comment-17877372 ] Jiri Humpolicek commented on SPARK-34638: - Hi, I currently tested similar example in spark-3.5.1 but I suppose that it will be same result in all versions after the fix in 3.2.0. Example: 1) Loading data {code:scala} val jsonStr = """{ "items": [ {"itemId": 1, "itemId2": 1, "itemData": "a"}, {"itemId": 2, "itemId2": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {code} 2) read query with explain {code:scala} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select(explode($"items").as('item)).select($"item.itemId", $"item.itemData").explain(true) // ReadSchema: struct>> {code} So it seems that when I use more than one field from structure after explode the resulting query reads whole structure, instead of only fields which I accessed. > Spark SQL reads unnecessary nested fields (another type of pruning case) > > > Key: SPARK-34638 > URL: https://issues.apache.org/jira/browse/SPARK-34638 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Jiri Humpolicek >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] > I found another nested fields pruning case. > Example: > 1) Loading data > {code:scala} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {code} > 2) read query with explain > {code:scala} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select(explode($"items").as('item)).select($"item.itemId").explain(true) > // ReadSchema: struct>> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49446) Upgrade jetty to 11.0.23
Yang Jie created SPARK-49446: Summary: Upgrade jetty to 11.0.23 Key: SPARK-49446 URL: https://issues.apache.org/jira/browse/SPARK-49446 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49446) Upgrade jetty to 11.0.23
[ https://issues.apache.org/jira/browse/SPARK-49446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49446: --- Labels: pull-request-available (was: ) > Upgrade jetty to 11.0.23 > > > Key: SPARK-49446 > URL: https://issues.apache.org/jira/browse/SPARK-49446 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49442) Complete Metadata requests on each micro batch causing Kafka issues
[ https://issues.apache.org/jira/browse/SPARK-49442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877397#comment-17877397 ] Jungtaek Lim commented on SPARK-49442: -- OK, that's unrelated. We haven't got any report for this kind of complaint. I recommend you to provide a minimal reproducer e.g. Apache Spark cluster & Apache Kafka cluster (no vendor version and no cloud service version), topic partition to 3-5 and increase topic-partition and prove that the metadata requests increase linearly, with the detailed explanation about how you capture the requests. If you are relying on any vendor rather than building the cluster on your own, it'd be ideal to contact with the support. > Complete Metadata requests on each micro batch causing Kafka issues > --- > > Key: SPARK-49442 > URL: https://issues.apache.org/jira/browse/SPARK-49442 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: vipin Kumar >Priority: Major > Labels: Kafka, spark-streaming-kafka > > We have noticed that spark does complete metadata requests on each micro > batch and this is causing high metadata requests on small micro batch > intervals . > > For example Kafka with 1900 partitions and 10 sec micro batch we are seeing > order of > ~{*}360K{*} metadata requests / sec > Same with job with 60 sec micro batch we are observing *~60K* meta data > requests. > > Metadata requests are controlled by *metadata.max.age.ms* but these config > have no effect on spark consumers by default its 5 mins still we are seeing > these huge requests. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45745) Extremely slow execution of sum of columns in Spark 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-45745. --- Resolution: Duplicate > Extremely slow execution of sum of columns in Spark 3.4.1 > - > > Key: SPARK-45745 > URL: https://issues.apache.org/jira/browse/SPARK-45745 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.1 >Reporter: Javier >Priority: Major > > We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to > Spark 3.4.1 and some code that was running fine is now basically never ending > even for small dataframes. > We have simplified the problematic piece of code and the minimum pySpark > example below shows the issue: > {code:java} > n_cols = 50 > data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)] > df_data = sql_context.createDataFrame(data) > df_data = df_data.withColumn( > "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)]) > ) > df_data.show(10, False) {code} > Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the > computation time seems to explode when the value of `n_cols` is bigger than > about 25 columns. A colleague suggested that it could be related to the limit > of 22 elements in a tuple in Scala 2.13 > (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 > columns are suspiciously close to this. Is there any known defect in the > logical plan optimization in 3.4.1? Or is this kind of operations (sum of > multiple columns) supposed to be implemented differently? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49313) Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version
[ https://issues.apache.org/jira/browse/SPARK-49313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-49313: - Assignee: BingKun Pan > Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version > - > > Key: SPARK-49313 > URL: https://issues.apache.org/jira/browse/SPARK-49313 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49313) Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version
[ https://issues.apache.org/jira/browse/SPARK-49313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-49313. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47809 [https://github.com/apache/spark/pull/47809] > Upgrade `DB2` & `MySQL` & `Postgres` & `Mariadb` docker image version > - > > Key: SPARK-49313 > URL: https://issues.apache.org/jira/browse/SPARK-49313 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49399) Add examples for different Spark image types
[ https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-49399. --- Fix Version/s: kubernetes-operator-0.1.0 Resolution: Fixed Issue resolved by pull request 108 [https://github.com/apache/spark-kubernetes-operator/pull/108] > Add examples for different Spark image types > > > Key: SPARK-49399 > URL: https://issues.apache.org/jira/browse/SPARK-49399 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49399) Add examples for different Spark image types
[ https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-49399: - Assignee: Zhou JIANG > Add examples for different Spark image types > > > Key: SPARK-49399 > URL: https://issues.apache.org/jira/browse/SPARK-49399 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49399) Add `pi-scala.yaml` and `pyspark-pi.yaml`
[ https://issues.apache.org/jira/browse/SPARK-49399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-49399: -- Summary: Add `pi-scala.yaml` and `pyspark-pi.yaml` (was: Add examples for different Spark image types) > Add `pi-scala.yaml` and `pyspark-pi.yaml` > - > > Key: SPARK-49399 > URL: https://issues.apache.org/jira/browse/SPARK-49399 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100
[ https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-49447: -- Description: The default value is `1s` (=1000). Usually, a small value like `1` happens when users do mistakes and forget to add the unit, `s`. > Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less > than 100 > --- > > Key: SPARK-49447 > URL: https://issues.apache.org/jira/browse/SPARK-49447 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > The default value is `1s` (=1000). Usually, a small value like `1` happens > when users do mistakes and forget to add the unit, `s`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100
[ https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49447: --- Labels: pull-request-available (was: ) > Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less > than 100 > --- > > Key: SPARK-49447 > URL: https://issues.apache.org/jira/browse/SPARK-49447 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > The default value is `1s` (=1000). Usually, a small value like `1` happens > when users do mistakes and forget to add the unit, `s`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100
[ https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-49447: - Assignee: Dongjoon Hyun > Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less > than 100 > --- > > Key: SPARK-49447 > URL: https://issues.apache.org/jira/browse/SPARK-49447 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > The default value is `1s` (=1000). Usually, a small value like `1` happens > when users do mistakes and forget to add the unit, `s`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48781) Add Catalog APIs for loading stored procedures
[ https://issues.apache.org/jira/browse/SPARK-48781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48781. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47190 [https://github.com/apache/spark/pull/47190] > Add Catalog APIs for loading stored procedures > -- > > Key: SPARK-48781 > URL: https://issues.apache.org/jira/browse/SPARK-48781 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add new connector catalog APIs for loading stored procedures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48781) Add Catalog APIs for loading stored procedures
[ https://issues.apache.org/jira/browse/SPARK-48781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48781: - Assignee: Anton Okolnychyi > Add Catalog APIs for loading stored procedures > -- > > Key: SPARK-48781 > URL: https://issues.apache.org/jira/browse/SPARK-48781 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Labels: pull-request-available > > Add new connector catalog APIs for loading stored procedures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default
[ https://issues.apache.org/jira/browse/SPARK-41262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-41262: --- Labels: pull-request-available (was: ) > Enable canChangeCachedPlanOutputPartitioning by default > --- > > Key: SPARK-41262 > URL: https://issues.apache.org/jira/browse/SPARK-41262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > Labels: pull-request-available > > Remove the `internal` tag of > `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from > false to true by default to make AQE work with cached plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48400) Promote `PrometheusServlet` to `DeveloperApi`
[ https://issues.apache.org/jira/browse/SPARK-48400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48400. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46716 [https://github.com/apache/spark/pull/46716] > Promote `PrometheusServlet` to `DeveloperApi` > - > > Key: SPARK-48400 > URL: https://issues.apache.org/jira/browse/SPARK-48400 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45923) Spark Kubernetes Operator
[ https://issues.apache.org/jira/browse/SPARK-45923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45923. --- Fix Version/s: 4.0.0 Resolution: Fixed > Spark Kubernetes Operator > - > > Key: SPARK-45923 > URL: https://issues.apache.org/jira/browse/SPARK-45923 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou Jiang >Assignee: Zhou Jiang >Priority: Major > Labels: SPIP, pull-request-available > Fix For: 4.0.0 > > > We would like to develop a Java-based Kubernetes operator for Apache Spark. > Following the operator pattern > (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark > users may manage applications and related components seamlessly using native > tools like kubectl. The primary goal is to simplify the Spark user experience > on Kubernetes, minimizing the learning curve and operational complexities and > therefore enable users to focus on the Spark application development. > Ideally, it would reside in a separate repository (like Spark docker or Spark > connect golang) and be loosely connected to the Spark release cycle while > supporting multiple Spark versions. > SPIP doc: > [https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE|https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE/edit#heading=h.hhham7siu2vi] > Dev email discussion : > [https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45923) Spark Kubernetes Operator
[ https://issues.apache.org/jira/browse/SPARK-45923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45923: -- Labels: SPIP releasenotes (was: SPIP pull-request-available) > Spark Kubernetes Operator > - > > Key: SPARK-45923 > URL: https://issues.apache.org/jira/browse/SPARK-45923 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou Jiang >Assignee: Zhou Jiang >Priority: Major > Labels: SPIP, releasenotes > Fix For: 4.0.0 > > > We would like to develop a Java-based Kubernetes operator for Apache Spark. > Following the operator pattern > (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark > users may manage applications and related components seamlessly using native > tools like kubectl. The primary goal is to simplify the Spark user experience > on Kubernetes, minimizing the learning curve and operational complexities and > therefore enable users to focus on the Spark application development. > Ideally, it would reside in a separate repository (like Spark docker or Spark > connect golang) and be loosely connected to the Spark release cycle while > supporting multiple Spark versions. > SPIP doc: > [https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE|https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE/edit#heading=h.hhham7siu2vi] > Dev email discussion : > [https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.
[ https://issues.apache.org/jira/browse/SPARK-49448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-49448: Description: {code:java} //代码占位符 {code} private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ override def run(): Unit = { try { execute() onCompletionPromise.success(()) } catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } } execute method end with ErrorUtils.handleError() function call. if any excetion throw. it will not catch by promise. is it better to catch real exceptions with promises instead of. was: private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") { override def run(): Unit = { try { execute() onCompletionPromise.success(()) } catch { case NonFatal(e) => onCompletionPromise.failure(e) } } } execute method end with ErrorUtils.handleError() function call. if any excetion throw. it will not catch by promise. is it better to catch real exceptions with promises instead of. > Spark Connect ExecuteThreadRunner promise will always complete with success. > > > Key: SPARK-49448 > URL: https://issues.apache.org/jira/browse/SPARK-49448 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: LIU >Priority: Minor > > {code:java} > //代码占位符 > {code} > private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends > Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ > override def run(): Unit = { try { execute() onCompletionPromise.success(()) > } catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } } > > execute method end with ErrorUtils.handleError() function call. if any > excetion throw. it will not catch by promise. is it better to catch real > exceptions with promises instead of. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.
LIU created SPARK-49448: --- Summary: Spark Connect ExecuteThreadRunner promise will always complete with success. Key: SPARK-49448 URL: https://issues.apache.org/jira/browse/SPARK-49448 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: LIU private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") { override def run(): Unit = { try { execute() onCompletionPromise.success(()) } catch { case NonFatal(e) => onCompletionPromise.failure(e) } } } execute method end with ErrorUtils.handleError() function call. if any excetion throw. it will not catch by promise. is it better to catch real exceptions with promises instead of. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49448) Spark Connect ExecuteThreadRunner promise will always complete with success.
[ https://issues.apache.org/jira/browse/SPARK-49448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIU updated SPARK-49448: Description: {code:java} private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") { override def run(): Unit = { try { execute() onCompletionPromise.success(()) } catch { case NonFatal(e) => onCompletionPromise.failure(e) } } }{code} execute method end with ErrorUtils.handleError() function call. if any excetion throw. it will not catch by promise. is it better to catch real exceptions with promises instead of? if wants. i will submit this change. was: {code:java} //代码占位符 {code} private class ExecutionThread(onCompletionPromise: Promise[Unit]) extends Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") \{ override def run(): Unit = { try { execute() onCompletionPromise.success(()) } catch \{ case NonFatal(e) => onCompletionPromise.failure(e) } } } execute method end with ErrorUtils.handleError() function call. if any excetion throw. it will not catch by promise. is it better to catch real exceptions with promises instead of. > Spark Connect ExecuteThreadRunner promise will always complete with success. > > > Key: SPARK-49448 > URL: https://issues.apache.org/jira/browse/SPARK-49448 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: LIU >Priority: Minor > > {code:java} > private class ExecutionThread(onCompletionPromise: Promise[Unit]) > extends > Thread(s"SparkConnectExecuteThread_opId=${executeHolder.operationId}") { > override def run(): Unit = { > try { > execute() > onCompletionPromise.success(()) > } catch { > case NonFatal(e) => > onCompletionPromise.failure(e) > } > } > }{code} > > execute method end with ErrorUtils.handleError() function call. if any > excetion throw. it will not catch by promise. is it better to catch real > exceptions with promises instead of? if wants. i will submit this change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49449) Remove string and binary from metadata in spec
[ https://issues.apache.org/jira/browse/SPARK-49449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49449: --- Labels: pull-request-available (was: ) > Remove string and binary from metadata in spec > -- > > Key: SPARK-49449 > URL: https://issues.apache.org/jira/browse/SPARK-49449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: David Cashman >Priority: Major > Labels: pull-request-available > > We never supported the string-from-metadata or binary-from-metadata. Remove > them for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49450) Improve normalised collation names
[ https://issues.apache.org/jira/browse/SPARK-49450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-49450: -- Parent: (was: SPARK-46830) Issue Type: Improvement (was: Sub-task) > Improve normalised collation names > -- > > Key: SPARK-49450 > URL: https://issues.apache.org/jira/browse/SPARK-49450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49421) Create a shared RelationalGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49421: --- Labels: pull-request-available (was: ) > Create a shared RelationalGroupedDataset interface > -- > > Key: SPARK-49421 > URL: https://issues.apache.org/jira/browse/SPARK-49421 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 > Environment: Not sure if we should do this. Connect and Classic have > different semantics, so unification is a bit tricky. >Reporter: Herman van Hövell >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49421) Create a shared RelationalGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-49421: - Assignee: Herman van Hövell > Create a shared RelationalGroupedDataset interface > -- > > Key: SPARK-49421 > URL: https://issues.apache.org/jira/browse/SPARK-49421 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 > Environment: Not sure if we should do this. Connect and Classic have > different semantics, so unification is a bit tricky. >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49419) Create a shared DataFrameStatFunctions interface
[ https://issues.apache.org/jira/browse/SPARK-49419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-49419: - Assignee: Herman van Hövell > Create a shared DataFrameStatFunctions interface > > > Key: SPARK-49419 > URL: https://issues.apache.org/jira/browse/SPARK-49419 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49450) Improve normalised collation names
[ https://issues.apache.org/jira/browse/SPARK-49450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49450: --- Labels: pull-request-available (was: ) > Improve normalised collation names > -- > > Key: SPARK-49450 > URL: https://issues.apache.org/jira/browse/SPARK-49450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-49422: -- Description: This dhpou > Create a shared KeyValueGroupedDataset interface > > > Key: SPARK-49422 > URL: https://issues.apache.org/jira/browse/SPARK-49422 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 > Environment: Not sure if we should do this. Connect and Classic have > different semantics, so unification is a bit tricky. >Reporter: Herman van Hövell >Priority: Major > > This dhpou -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-49422: -- Environment: (was: Not sure if we should do this. Connect and Classic have different semantics, so unification is a bit tricky.) > Create a shared KeyValueGroupedDataset interface > > > Key: SPARK-49422 > URL: https://issues.apache.org/jira/browse/SPARK-49422 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Priority: Major > > This dhpou -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49422) Create a shared KeyValueGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-49422: - Assignee: Herman van Hövell > Create a shared KeyValueGroupedDataset interface > > > Key: SPARK-49422 > URL: https://issues.apache.org/jira/browse/SPARK-49422 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > This dhpou -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49422) Create a shared KeyValueGroupedDataset interface
[ https://issues.apache.org/jira/browse/SPARK-49422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-49422: -- Description: This should also implement RelationalGroupedDataset.as[K: Encoder, T: Encoder]: KeyValueGroupedDataset[K, T]. (was: This dhpou) > Create a shared KeyValueGroupedDataset interface > > > Key: SPARK-49422 > URL: https://issues.apache.org/jira/browse/SPARK-49422 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > This should also implement RelationalGroupedDataset.as[K: Encoder, T: > Encoder]: KeyValueGroupedDataset[K, T]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49451) Allow duplicate keys in parse_json.
Chenhao Li created SPARK-49451: -- Summary: Allow duplicate keys in parse_json. Key: SPARK-49451 URL: https://issues.apache.org/jira/browse/SPARK-49451 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Chenhao Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49423) Consolidate Observation into a single class in sql/api
[ https://issues.apache.org/jira/browse/SPARK-49423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-49423: - Assignee: Herman van Hövell > Consolidate Observation into a single class in sql/api > -- > > Key: SPARK-49423 > URL: https://issues.apache.org/jira/browse/SPARK-49423 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 > Environment: Not sure if we should do this. Connect and Classic have > different semantics, so unification is a bit tricky. >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Move the implementation specific bits out of the class, and only keep the > Observation class. While we are at it, let's also replace the homegrown > threading stuff by futures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49451) Allow duplicate keys in parse_json.
[ https://issues.apache.org/jira/browse/SPARK-49451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49451: --- Labels: pull-request-available (was: ) > Allow duplicate keys in parse_json. > --- > > Key: SPARK-49451 > URL: https://issues.apache.org/jira/browse/SPARK-49451 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Chenhao Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49423) Consolidate Observation into a single class in sql/api
[ https://issues.apache.org/jira/browse/SPARK-49423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49423: --- Labels: pull-request-available (was: ) > Consolidate Observation into a single class in sql/api > -- > > Key: SPARK-49423 > URL: https://issues.apache.org/jira/browse/SPARK-49423 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 > Environment: Not sure if we should do this. Connect and Classic have > different semantics, so unification is a bit tricky. >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > > Move the implementation specific bits out of the class, and only keep the > Observation class. While we are at it, let's also replace the homegrown > threading stuff by futures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49425) Create a shared DataFrameWriter interface
[ https://issues.apache.org/jira/browse/SPARK-49425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49425: --- Labels: pull-request-available (was: ) > Create a shared DataFrameWriter interface > - > > Key: SPARK-49425 > URL: https://issues.apache.org/jira/browse/SPARK-49425 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49447) Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less than 100
[ https://issues.apache.org/jira/browse/SPARK-49447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-49447. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47913 [https://github.com/apache/spark/pull/47913] > Fix `spark.kubernetes.allocation.batch.delay` to prevent small values less > than 100 > --- > > Key: SPARK-49447 > URL: https://issues.apache.org/jira/browse/SPARK-49447 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The default value is `1s` (=1000). Usually, a small value like `1` happens > when users do mistakes and forget to add the unit, `s`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46995) Allow AQE coalesce final stage in SQL cached plan
[ https://issues.apache.org/jira/browse/SPARK-46995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46995: --- Labels: pull-request-available (was: ) > Allow AQE coalesce final stage in SQL cached plan > - > > Key: SPARK-46995 > URL: https://issues.apache.org/jira/browse/SPARK-46995 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Ziqi Liu >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/spark/pull/43435] and > [https://github.com/apache/spark/pull/43760] are fixing a correctness issue > which will be triggered when AQE applied on cached query plan, specifically, > when AQE coalescing the final result stage of the cached plan. > > The current semantic of > {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} > ([source > code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L403-L411]): > * when true, we enable AQE, but disable coalescing final stage > ({*}default{*}) > * when false, we disable AQE > > But let’s revisit the semantic of this config: actually for caller the only > thing that matters is whether we change the output partitioning of the cached > plan. And we should only try to apply AQE if possible. Thus we want to > modify the semantic of > {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} > * when true, we enable AQE and allow coalescing final: this might lead to > perf regression, because it introduce extra shuffle > * when false, we enable AQE, but disable coalescing final stage. *(this is > actually the `true` semantic of old behavior)* > Also, to keep the default behavior unchanged, we might want to flip the > default value of > {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} to `false` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49453) spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding Failure
Qi Tan created SPARK-49453: -- Summary: spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding Failure Key: SPARK-49453 URL: https://issues.apache.org/jira/browse/SPARK-49453 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 4.0.0 Reporter: Qi Tan I have a value.yaml as below: operatorConfiguration: dynamicConfig: enable: true create: true data: spark.kubernetes.operator.watchedNamespaces: "default, spark-1" helm install spark-kubernetes-operator --create-namespace -f build-tools/helm/spark-kubernetes-operator/values.yaml -f tests/e2e/helm/dynamic-config-values.yaml build-tools/helm/spark-kubernetes-operator/ The generated configmap data field does not contains the line spark.kubernetes.operator.watchedNamespaces: "default, spark-1". Note that if you run helm install --dry-run, the record exist -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49454) Avoid double normalization in the cache process
Xinyi Yu created SPARK-49454: Summary: Avoid double normalization in the cache process Key: SPARK-49454 URL: https://issues.apache.org/jira/browse/SPARK-49454 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Xinyi Yu There is an issue introduced in [#46465|https://github.com/apache/spark/pull/46465], which is that normalization is applied twice during the cache process. Some normalization rules may not be idempotent, so applying them repeatedly may break the plan shape and cause an unexpected cache miss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results
[ https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46037. - Fix Version/s: 4.0.0 3.5.3 Resolution: Fixed Issue resolved by pull request 47905 [https://github.com/apache/spark/pull/47905] > When Left Join build Left, ShuffledHashJoinExec may result in incorrect > results > --- > > Key: SPARK-46037 > URL: https://issues.apache.org/jira/browse/SPARK-46037 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: mcdull_zhang >Assignee: mcdull_zhang >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.3 > > > When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may > have incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results
[ https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46037: --- Assignee: mcdull_zhang > When Left Join build Left, ShuffledHashJoinExec may result in incorrect > results > --- > > Key: SPARK-46037 > URL: https://issues.apache.org/jira/browse/SPARK-46037 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: mcdull_zhang >Assignee: mcdull_zhang >Priority: Blocker > Labels: correctness, pull-request-available > > When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may > have incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49446) Upgrade jetty to 11.0.23
[ https://issues.apache.org/jira/browse/SPARK-49446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-49446. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47912 [https://github.com/apache/spark/pull/47912] > Upgrade jetty to 11.0.23 > > > Key: SPARK-49446 > URL: https://issues.apache.org/jira/browse/SPARK-49446 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42879) Spark SQL reads unnecessary nested fields
[ https://issues.apache.org/jira/browse/SPARK-42879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiri Humpolicek updated SPARK-42879: Affects Version/s: 3.5.2 > Spark SQL reads unnecessary nested fields > - > > Key: SPARK-42879 > URL: https://issues.apache.org/jira/browse/SPARK-42879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2, 3.5.2 >Reporter: Jiri Humpolicek >Priority: Major > > When we use more than one field from structure after explode, all fields will > be read. > Example: > 1) Loading data > {code:scala} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData1": "a", "itemData2": 11}, >{"itemId": 2, "itemData1": "b", "itemData2": 22} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {code} > 2) read query with explain > {code:scala} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read > .select(explode('items).as('item)) > .select($"item.itemId", $"item.itemData1") > .explain > // ReadSchema: > struct>> > {code} > We use only *itemId* and *itemData1* fields from structure in array, but read > schema contains *itemData2* field as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49455) Refactor `StagingInMemoryTableCatalog` to override the non-deprecated functions
Yang Jie created SPARK-49455: Summary: Refactor `StagingInMemoryTableCatalog` to override the non-deprecated functions Key: SPARK-49455 URL: https://issues.apache.org/jira/browse/SPARK-49455 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49455) Refactor `StagingInMemoryTableCatalog` to override the non-deprecated functions
[ https://issues.apache.org/jira/browse/SPARK-49455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49455: --- Labels: pull-request-available (was: ) > Refactor `StagingInMemoryTableCatalog` to override the non-deprecated > functions > --- > > Key: SPARK-49455 > URL: https://issues.apache.org/jira/browse/SPARK-49455 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49456) Spark website doesn't properly scroll to hash links
Neil Ramaswamy created SPARK-49456: -- Summary: Spark website doesn't properly scroll to hash links Key: SPARK-49456 URL: https://issues.apache.org/jira/browse/SPARK-49456 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Neil Ramaswamy On the version-specific Spark documentation, if you click a header, the page will scroll past the actual content, hiding it. For example, if you go to [this link|https://spark.apache.org/docs/latest/#downloading], you'll probably notice the page scroll past "Downloads". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)
[ https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877595#comment-17877595 ] Jiri Humpolicek commented on SPARK-34638: - [~viirya] Do you think it would be possible to do that? I think it will be great feature when spark reads only necessary fields from query in general way. In case of rich nested structures it could safe huge amount of resources. I found unresolved improvement for this more general case from last year https://issues.apache.org/jira/browse/SPARK-42879 . > Spark SQL reads unnecessary nested fields (another type of pruning case) > > > Key: SPARK-34638 > URL: https://issues.apache.org/jira/browse/SPARK-34638 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Jiri Humpolicek >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] > I found another nested fields pruning case. > Example: > 1) Loading data > {code:scala} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {code} > 2) read query with explain > {code:scala} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select(explode($"items").as('item)).select($"item.itemId").explain(true) > // ReadSchema: struct>> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)
[ https://issues.apache.org/jira/browse/SPARK-34638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877595#comment-17877595 ] Jiri Humpolicek edited comment on SPARK-34638 at 8/29/24 6:03 AM: -- [~viirya] Do you think it would be possible to do that? I think it will be great feature when spark reads only necessary fields from query in general way. In case of rich nested structures it could save huge amount of resources. I found unresolved improvement for this more general case from last year https://issues.apache.org/jira/browse/SPARK-42879 . was (Author: yuryn): [~viirya] Do you think it would be possible to do that? I think it will be great feature when spark reads only necessary fields from query in general way. In case of rich nested structures it could safe huge amount of resources. I found unresolved improvement for this more general case from last year https://issues.apache.org/jira/browse/SPARK-42879 . > Spark SQL reads unnecessary nested fields (another type of pruning case) > > > Key: SPARK-34638 > URL: https://issues.apache.org/jira/browse/SPARK-34638 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Jiri Humpolicek >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] > I found another nested fields pruning case. > Example: > 1) Loading data > {code:scala} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {code} > 2) read query with explain > {code:scala} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select(explode($"items").as('item)).select($"item.itemId").explain(true) > // ReadSchema: struct>> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49456) Spark website doesn't properly scroll to hash links
[ https://issues.apache.org/jira/browse/SPARK-49456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49456: --- Labels: pull-request-available (was: ) > Spark website doesn't properly scroll to hash links > > > Key: SPARK-49456 > URL: https://issues.apache.org/jira/browse/SPARK-49456 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > > On the version-specific Spark documentation, if you click a header, the page > will scroll past the actual content, hiding it. For example, if you go to > [this link|https://spark.apache.org/docs/latest/#downloading], you'll > probably notice the page scroll past "Downloads". > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49457) Remove uncommon curl option --retry-all-errors
Cheng Pan created SPARK-49457: - Summary: Remove uncommon curl option --retry-all-errors Key: SPARK-49457 URL: https://issues.apache.org/jira/browse/SPARK-49457 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49457) Remove uncommon curl option --retry-all-errors
[ https://issues.apache.org/jira/browse/SPARK-49457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49457: --- Labels: pull-request-available (was: ) > Remove uncommon curl option --retry-all-errors > -- > > Key: SPARK-49457 > URL: https://issues.apache.org/jira/browse/SPARK-49457 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49259) Size based partition creation during kafka read
[ https://issues.apache.org/jira/browse/SPARK-49259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49259: --- Labels: pull-request-available (was: ) > Size based partition creation during kafka read > --- > > Key: SPARK-49259 > URL: https://issues.apache.org/jira/browse/SPARK-49259 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Subham Singhal >Priority: Minor > Labels: pull-request-available > > Currently Spark + kafka structured streaming provides *minPartitions* config > to create more number of partitions than kafka has. This is helpful to > increase parallelism but this value is can not be changed dynamically. > It would be better to dynamically increase spark partitions based on input > size, if input size is high create more partitions. We can take *avg msg > size* and *maxBytesPerPartition* as input and dynamically create partitions > to handle varying loads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49453) spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding Failure
[ https://issues.apache.org/jira/browse/SPARK-49453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49453: --- Labels: pull-request-available (was: ) > spark-kubernetes-operator-dynamic-configuration ConfigMap Data Overriding > Failure > - > > Key: SPARK-49453 > URL: https://issues.apache.org/jira/browse/SPARK-49453 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Qi Tan >Priority: Trivial > Labels: pull-request-available > > I have a value.yaml as below: > operatorConfiguration: > dynamicConfig: > enable: true > create: true > data: > spark.kubernetes.operator.watchedNamespaces: "default, spark-1" > helm install spark-kubernetes-operator --create-namespace -f > build-tools/helm/spark-kubernetes-operator/values.yaml -f > tests/e2e/helm/dynamic-config-values.yaml > build-tools/helm/spark-kubernetes-operator/ > The generated configmap data field does not contains the line > spark.kubernetes.operator.watchedNamespaces: "default, spark-1". Note that if > you run helm install --dry-run, the record exist -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org