[jira] [Commented] (SPARK-42755) Factor literal value conversion out to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699184#comment-17699184 ] Apache Spark commented on SPARK-42755: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40375 > Factor literal value conversion out to connect-common > - > > Key: SPARK-42755 > URL: https://issues.apache.org/jira/browse/SPARK-42755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42755) Factor literal value conversion out to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42755: Assignee: Apache Spark > Factor literal value conversion out to connect-common > - > > Key: SPARK-42755 > URL: https://issues.apache.org/jira/browse/SPARK-42755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42755) Factor literal value conversion out to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699183#comment-17699183 ] Apache Spark commented on SPARK-42755: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40375 > Factor literal value conversion out to connect-common > - > > Key: SPARK-42755 > URL: https://issues.apache.org/jira/browse/SPARK-42755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42755) Factor literal value conversion out to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42755: Assignee: (was: Apache Spark) > Factor literal value conversion out to connect-common > - > > Key: SPARK-42755 > URL: https://issues.apache.org/jira/browse/SPARK-42755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42755) Factor literal value conversion out to connect-common
[ https://issues.apache.org/jira/browse/SPARK-42755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-42755: -- Summary: Factor literal value conversion out to connect-common (was: Move literal value conversion to connect-common) > Factor literal value conversion out to connect-common > - > > Key: SPARK-42755 > URL: https://issues.apache.org/jira/browse/SPARK-42755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42755) Move literal value conversion to connect-common
Ruifeng Zheng created SPARK-42755: - Summary: Move literal value conversion to connect-common Key: SPARK-42755 URL: https://issues.apache.org/jira/browse/SPARK-42755 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699153#comment-17699153 ] Apache Spark commented on SPARK-42721: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40374 > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.4.1 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
[ https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-42754: --- Description: In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL executions when replaying event logs generated by older Spark versions. {*}Reproduction{*}: {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} {code:java} sql("select * from range(10)").collect() sql("select * from range(20)").collect() sql("select * from range(30)").collect(){code} Exit the shell and use the Spark History Server to replay this application's UI. In the SQL tab I expect to see three separate queries, but Spark 3.4's history server incorrectly groups the second and third queries as nested queries of the first (see attached screenshot). {*}Root cause{*}: [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new *non-optional* {{rootExecutionId: Long}} field to the SparkListenerSQLExecutionStart case class. When JsonProtocol deserializes this event it uses the "ignore missing properties" Jackson deserialization option, causing the {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. The value {{0}} is a legitimate execution ID, so in the deserialized event we have no ability to distinguish between the absence of a value and a case where all queries have the first query as the root. *Proposed* {*}fix{*}: I think we should change this field to be of type {{Option[Long]}} . I believe this is a release blocker for Spark 3.4.0 because we cannot change the type of this new field in a future release without breaking binary compatibility. was: In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL executions when replaying event logs generated by older Spark versions. {*}Reproduction{*}: {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} {code:java} sql("select * from range(10)").collect() sql("select * from range(20)").collect() sql("select * from range(30)").collect(){code} Exit the shell and use the Spark History Server to replay this UI. In the SQL tab I expect to see three separate queries, but Spark 3.4's history server incorrectly groups the second and third queries as nested queries of the first (see attached screenshot). {*}Root cause{*}: [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new *non-optional* {{rootExecutionId: Long}} field to the SparkListenerSQLExecutionStart case class. When JsonProtocol deserializes this event it uses the "ignore missing properties" Jackson deserialization option, causing the {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. The value {{0}} is a legitimate execution ID, so in the deserialized event we have no ability to distinguish between the absence of a value and a case where all queries have the first query as the root. *Proposed* {*}fix{*}: I think we should change this field to be of type {{Option[Long]}} . I believe this is a release blocker for Spark 3.4.0 because we cannot change the type of this new field in a future release without breaking binary compatibility. > Spark 3.4 history server's SQL tab incorrectly groups SQL executions when > replaying event logs from Spark 3.3 and earlier > - > > Key: SPARK-42754 > URL: https://issues.apache.org/jira/browse/SPARK-42754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Blocker > Attachments: example.png > > > In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL > executions when replaying event logs generated by older Spark versions. > > {*}Reproduction{*}: > {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf > spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} > {code:java} > sql("select * from range(10)").collect() > sql("select * from range(20)").collect() > sql("select * from range(30)").collect(){code} > Exit the shell and use the Spark History Server to replay this application's > UI. > In the SQL tab I expect to see three separate queries, but Spark 3.4's > history server incorrectly groups the second and third queries as nested > queries of the first (see attached screenshot). > > {*}Root cause{*}: > [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new > *non-optional* {{rootExecutionId: Long}} field to the > SparkListenerSQLExecutionStart case class. > When JsonProtocol deserializes this event it uses the "ignore missing > pr
[jira] [Created] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
Josh Rosen created SPARK-42754: -- Summary: Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier Key: SPARK-42754 URL: https://issues.apache.org/jira/browse/SPARK-42754 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Josh Rosen Attachments: example.png In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL executions when replaying event logs generated by older Spark versions. {*}Reproduction{*}: {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} {code:java} sql("select * from range(10)").collect() sql("select * from range(20)").collect() sql("select * from range(30)").collect(){code} Exit the shell and use the Spark History Server to replay this UI. In the SQL tab I expect to see three separate queries, but Spark 3.4's history server incorrectly groups the second and third queries as nested queries of the first (see attached screenshot). {*}Root cause{*}: [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new *non-optional* {{rootExecutionId: Long}} field to the SparkListenerSQLExecutionStart case class. When JsonProtocol deserializes this event it uses the "ignore missing properties" Jackson deserialization option, causing the {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. The value {{0}} is a legitimate execution ID, so in the deserialized event we have no ability to distinguish between the absence of a value and a case where all queries have the first query as the root. *Proposed* {*}fix{*}: I think we should change this field to be of type {{Option[Long]}} . I believe this is a release blocker for Spark 3.4.0 because we cannot change the type of this new field in a future release without breaking binary compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
[ https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-42754: --- Attachment: example.png > Spark 3.4 history server's SQL tab incorrectly groups SQL executions when > replaying event logs from Spark 3.3 and earlier > - > > Key: SPARK-42754 > URL: https://issues.apache.org/jira/browse/SPARK-42754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Blocker > Attachments: example.png > > > In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL > executions when replaying event logs generated by older Spark versions. > > {*}Reproduction{*}: > {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf > spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} > {code:java} > sql("select * from range(10)").collect() > sql("select * from range(20)").collect() > sql("select * from range(30)").collect(){code} > Exit the shell and use the Spark History Server to replay this UI. > In the SQL tab I expect to see three separate queries, but Spark 3.4's > history server incorrectly groups the second and third queries as nested > queries of the first (see attached screenshot). > > {*}Root cause{*}: > [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new > *non-optional* {{rootExecutionId: Long}} field to the > SparkListenerSQLExecutionStart case class. > When JsonProtocol deserializes this event it uses the "ignore missing > properties" Jackson deserialization option, causing the > {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. > The value {{0}} is a legitimate execution ID, so in the deserialized event we > have no ability to distinguish between the absence of a value and a case > where all queries have the first query as the root. > *Proposed* {*}fix{*}: > I think we should change this field to be of type {{Option[Long]}} . I > believe this is a release blocker for Spark 3.4.0 because we cannot change > the type of this new field in a future release without breaking binary > compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699148#comment-17699148 ] ming95 commented on SPARK-42753: [~steven.chen] Can you provide the reproduction code? > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42721. --- Fix Version/s: 3.4.1 (was: 3.5.0) Resolution: Fixed > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.4.1 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-42721: - Assignee: Raghu Angadi > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.4.1 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42749) CAST(x as int) does not generate error with overflow
[ https://issues.apache.org/jira/browse/SPARK-42749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699137#comment-17699137 ] Yuming Wang commented on SPARK-42749: - Please enable ansi: {code:sql} spark-sql (default)> set spark.sql.ansi.enabled=true; spark.sql.ansi.enabled true Time taken: 0.088 seconds, Fetched 1 row(s) spark-sql (default)> select cast(7.415246799222789E19 as int); [CAST_OVERFLOW] The value 7.415246799222789E19D of the type "DOUBLE" cannot be cast to "INT" due to an overflow. Use `try_cast` to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW] The value 7.415246799222789E19D of the type "DOUBLE" cannot be cast to "INT" due to an overflow. Use `try_cast` to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {code} > CAST(x as int) does not generate error with overflow > > > Key: SPARK-42749 > URL: https://issues.apache.org/jira/browse/SPARK-42749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.3.1, 3.3.2 > Environment: It was tested on a DataBricks environment with DBR 10.4 > and above, running Spark v3.2.1 and above. >Reporter: Tjomme Vergauwen >Priority: Major > Attachments: Spark-42749.PNG > > > Hi, > When performing the following code: > {{select cast(7.415246799222789E19 as int)}} > according to the documentation, an error is expected as > {{7.415246799222789E19 }}is an overflow value for datatype INT. > However, the value 2147483647 is returned. > The behaviour of the following is correct as it returns NULL: > {{select try_cast(7.415246799222789E19 as int) }} > This results in unexpected behaviour and data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Chen updated SPARK-42753: Summary: ReusedExchange refers to non-existent node (was: ReusedExchange refers to non-existen node) > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42753) ReusedExchange refers to non-existen node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Chen updated SPARK-42753: Description: There is an AQE “issue“ where during AQE planning, the Exchange "that's being" reused could be replaced in the plan tree. So, when we print the query plan, the ReusedExchange will refer to an “unknown“ Exchange. An example below: {code:java} (2775) ReusedExchange [Reuses operator id: unknown] Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} Below is an example to demonstrate the root cause: {code:java} AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A |-- SomeNode Y |-- Exchange B Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C |-- SomeNode N |-- Exchange D {code} Step 1: Exchange B is materialized and the QueryStage is added to stage cache Step 2: Exchange D reuses Exchange B Step 3: Exchange C is materialized and the QueryStage is added to stage cache Step 4: Exchange A reuses Exchange C Then the final plan looks like: {code:java} AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A -> ReusedExchange (reuses Exchange C) Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C -> PhotonShuffleMapStage |-- SomeNode N |-- Exchange D -> ReusedExchange (reuses Exchange B) {code} As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist node. This *DOES NOT* affect query execution but will cause the query visualization malfunction in the following ways: # The ReusedExchange child subtree will still appear in the Spark UI graph but will contain no node IDs. # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN node. Example below. {code:java} (2775) ReusedExchange [Reuses operator id: unknown]{code} # The child exchange and its subtree may be missing from the Explain text completely. No node details or tree string shown. was: There is an AQE “issue“ where during AQE planning, the Exchange "that's being" reused could be replaced in the plan tree. So, when we print the query plan, the ReusedExchange will refer to an “unknown“ Exchange. An example below:{{{}{}}} {code:java} {code} {{ (2775) ReusedExchange [Reuses operator id: unknown] Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]}} {{ }} Below is an example to demonstrate the root cause: {{}} {code:java} {code} {{AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A |-- SomeNode Y |-- Exchange B Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C |-- SomeNode N |-- Exchange D}} {{ }} Step 1: Exchange B is materialized and the QueryStage is added to stage cache Step 2: Exchange D reuses Exchange B Step 3: Exchange C is materialized and the QueryStage is added to stage cache Step 4: Exchange A reuses Exchange C Then the final plan looks like: {{}} {code:java} {code} {{AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A -> ReusedExchange (reuses Exchange C) Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C -> PhotonShuffleMapStage |-- SomeNode N |-- Exchange D -> ReusedExchange (reuses Exchange B)}} {{ }} As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist node. This *DOES NOT* affect query execution but will cause the query visualization malfunction in the following ways: # The ReusedExchange child subtree will still appear in the Spark UI graph but will contain no node IDs. # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN node. Example below. {code:java} (2775) ReusedExchange [Reuses operator id: unknown]{code} # The child exchange and its subtree may be missing from the Explain text completely. No node details or tree string shown. > ReusedExchange refers to non-existen node > - > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2
[jira] [Created] (SPARK-42753) ReusedExchange refers to non-existen node
Steven Chen created SPARK-42753: --- Summary: ReusedExchange refers to non-existen node Key: SPARK-42753 URL: https://issues.apache.org/jira/browse/SPARK-42753 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 3.4.0 Reporter: Steven Chen There is an AQE “issue“ where during AQE planning, the Exchange "that's being" reused could be replaced in the plan tree. So, when we print the query plan, the ReusedExchange will refer to an “unknown“ Exchange. An example below:{{{}{}}} {code:java} {code} {{ (2775) ReusedExchange [Reuses operator id: unknown] Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]}} {{ }} Below is an example to demonstrate the root cause: {{}} {code:java} {code} {{AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A |-- SomeNode Y |-- Exchange B Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C |-- SomeNode N |-- Exchange D}} {{ }} Step 1: Exchange B is materialized and the QueryStage is added to stage cache Step 2: Exchange D reuses Exchange B Step 3: Exchange C is materialized and the QueryStage is added to stage cache Step 4: Exchange A reuses Exchange C Then the final plan looks like: {{}} {code:java} {code} {{AdaptiveSparkPlan |-- SomeNode X (subquery xxx) |-- Exchange A -> ReusedExchange (reuses Exchange C) Subquery:Hosting operator = SomeNode Hosting Expression = xxx dynamicpruning#388 AdaptiveSparkPlan |-- SomeNode M |-- Exchange C -> PhotonShuffleMapStage |-- SomeNode N |-- Exchange D -> ReusedExchange (reuses Exchange B)}} {{ }} As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist node. This *DOES NOT* affect query execution but will cause the query visualization malfunction in the following ways: # The ReusedExchange child subtree will still appear in the Spark UI graph but will contain no node IDs. # The ReusedExchange node details in the Explain plan will refer to a UNKNOWN node. Example below. {code:java} (2775) ReusedExchange [Reuses operator id: unknown]{code} # The child exchange and its subtree may be missing from the Explain text completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42101: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
[ https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42718. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40337 [https://github.com/apache/spark/pull/40337] > Upgrade rocksdbjni to 7.10.2 > > > Key: SPARK-42718 > URL: https://issues.apache.org/jira/browse/SPARK-42718 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
[ https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42718: - Assignee: Yang Jie > Upgrade rocksdbjni to 7.10.2 > > > Key: SPARK-42718 > URL: https://issues.apache.org/jira/browse/SPARK-42718 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42752: Assignee: Apache Spark > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Assignee: Apache Spark >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", > line 117, in deco > raise converted from None > pyspark.sql.utils.IllegalArgumentException: > {code} > 4. In fact you can just call spark.conf to trigger this issue > {code:java} > >>> spark.conf > Traceback (most recent call last): > File "", line 1, in > ... > {code} > There are probably two issues here: > 1) that Hive support should be gracefully disabled if it the dependency not > on the classpath as claimed by > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html > 2) but at the very least the user should be able to see the exception to > understand the issue, and take an action > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699108#comment-17699108 ] Apache Spark commented on SPARK-42752: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/40372 > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", > line 117, in deco > raise converted from None > pyspark.sql.utils.IllegalArgumentException: > {code} > 4. In fact you can just call spark.conf to trigger this issue > {code:java} > >>> spark.conf > Traceback (most recent call last): > File "", line 1, in > ... > {code} > There are probably two issues here: > 1) that Hive support should be gracefully disabled if it the dependency not > on the classpath as claimed by > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html > 2) but at the very least the user should be able to see the exception to > understand the issue, and take an action > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42752: Assignee: (was: Apache Spark) > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", > line 117, in deco > raise converted from None > pyspark.sql.utils.IllegalArgumentException: > {code} > 4. In fact you can just call spark.conf to trigger this issue > {code:java} > >>> spark.conf > Traceback (most recent call last): > File "", line 1, in > ... > {code} > There are probably two issues here: > 1) that Hive support should be gracefully disabled if it the dependency not > on the classpath as claimed by > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html > 2) but at the very least the user should be able to see the exception to > understand the issue, and take an action > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699107#comment-17699107 ] Apache Spark commented on SPARK-42752: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/40372 > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", > line 117, in deco > raise converted from None > pyspark.sql.utils.IllegalArgumentException: > {code} > 4. In fact you can just call spark.conf to trigger this issue > {code:java} > >>> spark.conf > Traceback (most recent call last): > File "", line 1, in > ... > {code} > There are probably two issues here: > 1) that Hive support should be gracefully disabled if it the dependency not > on the classpath as claimed by > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html > 2) but at the very least the user should be able to see the exception to > understand the issue, and take an action > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-42752: -- Description: Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action was: Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: >>> spark.conf Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf self._conf = RuntimeConfig(self._jsparkSession.conf()) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in
[jira] [Created] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
Gera Shegalov created SPARK-42752: - Summary: Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution Key: SPARK-42752 URL: https://issues.apache.org/jira/browse/SPARK-42752 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 Environment: local Reporter: Gera Shegalov Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: >>> spark.conf Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf self._conf = RuntimeConfig(self._jsparkSession.conf()) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41498: Assignee: Apache Spark > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Assignee: Apache Spark >Priority: Major > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41498: Assignee: (was: Apache Spark) > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699076#comment-17699076 ] Dongjoon Hyun commented on SPARK-41498: --- This is reverted via https://github.com/apache/spark/commit/164db5ba3c39614017f5ef6428194a442d79b425 > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41498: -- Fix Version/s: (was: 3.4.0) > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-41498: --- Assignee: (was: Fredrik Klauß) > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > Fix For: 3.4.0 > > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42685) optimize byteToString routines
[ https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42685: -- Affects Version/s: 3.5.0 (was: 3.3.2) > optimize byteToString routines > -- > > Key: SPARK-42685 > URL: https://issues.apache.org/jira/browse/SPARK-42685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Alkis Evlogimenos >Priority: Minor > Fix For: 3.5.0 > > > {{Utils.byteToString routines are slow because they use BigInt and > BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex
[ https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] IonK updated SPARK-42751: - Description: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,"value", )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} My temporary workaround is using {code:java} df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} was: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} My temporary workaround is using {code:java} df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} > Pyspark.pandas.series.str.findall can't handle tuples that are returned by > regex > > > Key: SPARK-42751 > URL: https://issues.apache.org/jira/browse/SPARK-42751 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.2 >Reporter: IonK >Priority: Major > > When you use the str.findall accessor method on a ps.series and you're > passing a regex pattern that will return match groups, it will return a > pyarrow data error. > In pandas the result is this: > {code:java} > df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) > returns > [("value", , , , )], > [("value", , , , )], > [(, , ,"value", )]{code} > > In pyspark.pandas the result is: > {code:java} > org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: > Expected bytes, got a 'tuple' object'.{code} > > My temporary workaround is using > {code:java} > df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex
[ https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] IonK updated SPARK-42751: - Description: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} My temporary workaround is using {code:java} df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} was: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} > Pyspark.pandas.series.str.findall can't handle tuples that are returned by > regex > > > Key: SPARK-42751 > URL: https://issues.apache.org/jira/browse/SPARK-42751 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.2 >Reporter: IonK >Priority: Major > > When you use the str.findall accessor method on a ps.series and you're > passing a regex pattern that will return match groups, it will return a > pyarrow data error. > In pandas the result is this: > {code:java} > df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) > returns > [("value", , , , )], > [("value", , , , )], > [(, , ,value , )]{code} > > In pyspark.pandas the result is: > {code:java} > org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: > Expected bytes, got a 'tuple' object'.{code} > > My temporary workaround is using > {code:java} > df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex
IonK created SPARK-42751: Summary: Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex Key: SPARK-42751 URL: https://issues.apache.org/jira/browse/SPARK-42751 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.3.2 Reporter: IonK When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex
[ https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] IonK updated SPARK-42751: - Description: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} was: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} > Pyspark.pandas.series.str.findall can't handle tuples that are returned by > regex > > > Key: SPARK-42751 > URL: https://issues.apache.org/jira/browse/SPARK-42751 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.2 >Reporter: IonK >Priority: Major > > When you use the str.findall accessor method on a ps.series and you're > passing a regex pattern that will return match groups, it will return a > pyarrow data error. > In pandas the result is this: > {code:java} > df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) > returns > [("value", , , , )], > [("value", , , , )], > [(, , ,value , )]{code} > > In pyspark.pandas the result is: > {code:java} > org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: > Expected bytes, got a 'tuple' object'.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42751) Pyspark.pandas.series.str.findall can't handle tuples that are returned by regex
[ https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] IonK updated SPARK-42751: - Description: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: {code:java} org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'.{code} was: When you use the str.findall accessor method on a ps.series and you're passing a regex pattern that will return match groups, it will return a pyarrow data error. In pandas the result is this: {code:java} df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) returns [("value", , , , )], [("value", , , , )], [(, , ,value , )]{code} In pyspark.pandas the result is: org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object'. > Pyspark.pandas.series.str.findall can't handle tuples that are returned by > regex > > > Key: SPARK-42751 > URL: https://issues.apache.org/jira/browse/SPARK-42751 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.2 >Reporter: IonK >Priority: Major > > When you use the str.findall accessor method on a ps.series and you're > passing a regex pattern that will return match groups, it will return a > pyarrow data error. > In pandas the result is this: > > {code:java} > df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) > returns > [("value", , , , )], > [("value", , , , )], > [(, , ,value , )]{code} > In pyspark.pandas the result is: > {code:java} > org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: > Expected bytes, got a 'tuple' object'.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42743) Support analyze TimestampNTZ columns
[ https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-42743: --- Fix Version/s: 3.4.0 (was: 3.5.0) > Support analyze TimestampNTZ columns > > > Key: SPARK-42743 > URL: https://issues.apache.org/jira/browse/SPARK-42743 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42661) CSV Reader - multiline without quoted fields
[ https://issues.apache.org/jira/browse/SPARK-42661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699056#comment-17699056 ] Sean R. Owen commented on SPARK-42661: -- I don't know that multi-line makes sense without quoting in your case, as you have values broken across lines. You should quote > CSV Reader - multiline without quoted fields > > > Key: SPARK-42661 > URL: https://issues.apache.org/jira/browse/SPARK-42661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: unquoted data > {code} > NAME,Address,CITY > Atlassian,Level 6 341 George Street > Sydney NSW 2000 Australia,Sydney > Github,88 Colin P Kelly Junior Street > San Francisco CA 94107 USA,San Francisco > {code} > quoted data : > {code} > "NAME","Address","CITY" > "Atlassian","Level 6 341 George Street > Sydney NSW 2000 Australia","Sydney" > "Github","88 Colin P Kelly Junior Street > San Francisco CA 94107 USA","San Francisco" > {code} >Reporter: Florian FERREIRA >Priority: Minor > Attachments: Capture d’écran 2023-03-03 à 12.18.07.png > > > Hello, > We are facing an issue with the CSV format. > When we try to read a "multiline file without quoted fields" the expected > result is not good. > With quoted fields, all is ok. ( cf the screenshot ) > You can reproduce it easily with this code (just replace file path ) : > {code:java} > spark.read.options(Map( > "multiline" -> "true", > "quote" -> "", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline.csv").show(false) > spark.read.options(Map( > "multiline" -> "true", > "header" -> "true", > )).csv("/Users/fferreira/correct_multiline_with_quote.csv").show(false) > {code} > We continue to investigate on our side. > Thanks you. > !image-2023-03-03-12-11-21-258.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42685) optimize byteToString routines
[ https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42685: - Priority: Minor (was: Major) > optimize byteToString routines > -- > > Key: SPARK-42685 > URL: https://issues.apache.org/jira/browse/SPARK-42685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Alkis Evlogimenos >Priority: Minor > > {{Utils.byteToString routines are slow because they use BigInt and > BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42685) optimize byteToString routines
[ https://issues.apache.org/jira/browse/SPARK-42685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42685. -- Fix Version/s: 3.5.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40301 > optimize byteToString routines > -- > > Key: SPARK-42685 > URL: https://issues.apache.org/jira/browse/SPARK-42685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Alkis Evlogimenos >Priority: Minor > Fix For: 3.5.0 > > > {{Utils.byteToString routines are slow because they use BigInt and > BigDecimal. This is causing visible CPU usage (1-2% in scan benchmarks).}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch
[ https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42036. -- Resolution: Not A Problem Mismatching java versions would never be supported per se > Kryo ClassCastException getting task result when JDK versions mismatch > -- > > Key: SPARK-42036 > URL: https://issues.apache.org/jira/browse/SPARK-42036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > {noformat} > 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: > java.lang.Integer cannot be cast to java.nio.ByteBuffer > Serialization trace: > lowerBounds (org.apache.iceberg.GenericDataFile) > taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) > writerCommitMessage > (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > {noformat} > Iceberg 1.1 `BaseFile.lowerBounds` is defined as > {code:java} > Map {code} > Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) > Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS > Kryo version: 4.0.2 > > Same Spark job works when both driver and executors run the same JDK 8 or JDK > 17. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42198) spark.read fails to read filenames with accented characters
[ https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699044#comment-17699044 ] Sean R. Owen commented on SPARK-42198: -- You would not add /dbfs on Databricks in this case, that's not relevant or the issue. What if you escape the path as if in a URL? > spark.read fails to read filenames with accented characters > --- > > Key: SPARK-42198 > URL: https://issues.apache.org/jira/browse/SPARK-42198 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Tarique Anwer >Priority: Major > > Unable to read filenames with accented characters in the filename. > *Sample error:* > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in > stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 > (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: > /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code} > > *{{Steps to reproduce error:}}* > {code:java} > %sh > mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass > wget > https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip > -O ./synthea_sample_data_ccda_sep2019.zip > unzip ./synthea_sample_data_ccda_sep2019.zip -d > /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ > {code} > > {code:java} > spark.conf.set("spark.sql.caseSensitive", "true") > df = ( > spark.read.format('xml') > .option("rowTag", "ClinicalDocument") > .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/') > ){code} > Is there a way to deal with this situation where I don't have control over > the file names for some reason? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42127) Spark 3.3.0, Error with java.io.IOException: Mkdirs failed to create file
[ https://issues.apache.org/jira/browse/SPARK-42127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42127. -- Resolution: Not A Problem No detail here, not obvious that this isn't just a permissions issue > Spark 3.3.0, Error with java.io.IOException: Mkdirs failed to create file > - > > Key: SPARK-42127 > URL: https://issues.apache.org/jira/browse/SPARK-42127 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: shamim >Priority: Major > > 23/01/18 20:23:24 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) > (10.64.109.72 executor 0): java.io.IOException: Mkdirs failed to create > file:/var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0 > (exists=false, cwd=file:/opt/spark-3.3.0/work/app-20230118202317-0001/0) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:515) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1081) > at > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:113) > at > org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:238) > at > org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126) > at > org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:136) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699042#comment-17699042 ] Sean R. Owen commented on SPARK-42425: -- The docs don't say it's part of the Spark distro. in fact it tells you to bundle it in your app. It is not bundled on purpose. > spark-hadoop-cloud is not provided in the default Spark distribution > > > Key: SPARK-42425 > URL: https://issues.apache.org/jira/browse/SPARK-42425 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.3.1 >Reporter: Arseniy Tashoyan >Priority: Major > > The library spark-hadoop-cloud is absent in the default Spark distribution > (as well as its dependencies like hadoop-aws). Therefore the dependency > management section described in [Integration with Cloud > Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation] > is invalid. Actually the libraries for cloud integration are not provided. > A naive workaround would be to add the spark-hadoop-cloud library as a > compile-scope dependency. However, this does not work due to Spark classpath > hierarchy. Spark system classloader does not see classes loaded by the > application classloader. > Therefore a proper fix would be to enable the hadoop-cloud build profile by > default: -Phadoop-cloud -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42425. -- Resolution: Not A Problem > spark-hadoop-cloud is not provided in the default Spark distribution > > > Key: SPARK-42425 > URL: https://issues.apache.org/jira/browse/SPARK-42425 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.3.1 >Reporter: Arseniy Tashoyan >Priority: Major > > The library spark-hadoop-cloud is absent in the default Spark distribution > (as well as its dependencies like hadoop-aws). Therefore the dependency > management section described in [Integration with Cloud > Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation] > is invalid. Actually the libraries for cloud integration are not provided. > A naive workaround would be to add the spark-hadoop-cloud library as a > compile-scope dependency. However, this does not work due to Spark classpath > hierarchy. Spark system classloader does not see classes loaded by the > application classloader. > Therefore a proper fix would be to enable the hadoop-cloud build profile by > default: -Phadoop-cloud -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42479) Log4j2 doesn't works with Spark 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-42479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699041#comment-17699041 ] Sean R. Owen commented on SPARK-42479: -- This is because you're pointing at some local path not visible on the workers. It's quite expected > Log4j2 doesn't works with Spark 3.3.0 > - > > Key: SPARK-42479 > URL: https://issues.apache.org/jira/browse/SPARK-42479 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 3.2.1, 3.3.0 >Reporter: Pratik Malani >Priority: Major > > Hi All, > Was trying to run spark application on the cluster mode using log4j2 and > Spark 3.3.0. > When I run the below spark-submit command, only one worker (out of 3) starts > executing the job. > {code:java} > // code placeholder > spark-submit --master spark://spark-master-svc:7077 \ > --conf spark.cores.max=4 \ > --conf spark.sql.broadcastTimeout=3600 \ > --conf spark.executor.cores=1 \ > --jars /opt/spark/work-dir/.jar \ > --deploy-mode cluster \ > --class \ > --properties-file /opt/spark/conf/spark-defaults.conf \ > --conf > spark.driver.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true > -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \ > --conf > spark.executor.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true > -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \ > --files "/opt/spark/work-dir/.properties" \ > /opt/spark/work-dir/.jar > /opt/spark/work-dir/application.properties >> /var/log/containers/hourly.log > 2>&1 {code} > It means, on only one worker, I can see the driver logs and the other workers > are idle and there are no app or executors logs created on other workers. > Below is the log4j2.properties file being used. > {code:java} > // code placeholder > rootLogger.level = INFO > rootLogger.appenderRef.rolling.ref = loggerId > appender.rolling.type = RollingFile > appender.rolling.name = loggerId > appender.rolling.fileName=/var/log/containers/hourly.log > appender.rolling.filePattern=hourly-.%d{MMdd}.log.gz > appender.rolling.layout.type = PatternLayout > appender.rolling.layout.pattern=%d [%t] %-5p (%F:%L) - %m%n > appender.rolling.policies.type = Policies > appender.rolling.policies.size.type = TimeBasedTriggeringPolicy > appender.rolling.strategy.type = DefaultRolloverStrategy > appender.rolling.strategy.max = 5 > logger.spark.name = org.apache.spark > logger.spark.level = WARN > logger.spark.additivity = false > logger.spark.repl.SparkIMain$exprTyper.level = INFO > logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO > # Settings to quiet third party logs that are too verbose > logger.jetty.name = org.eclipse.jetty > logger.jetty.level = WARN > logger.jetty.util.component.AbstractLifeCycle.level = ERROR > logger.parquet.name = org.apache.parquet > logger.parquet.level = ERROR > logger.kafka.name = org.apache.kafka > logger.kafka.level = WARN > logger.kafka.clients.consumer.internals.Fetcher.level=WARN {code} > All log4j2 jars are included in the Spark home classpath under the jars > directory. > * log4j-1.2-api-2.17.2.jar > * log4j-api-2.17.2.jar > * log4j-api-scala_2.12-12.0.jar > * log4j-core-2.17.2.jar > * log4j-slf4j-impl-2.17.2.jar > Can you please check and let me know whether I need to add or update anything > to start the job in cluster mode supporting log4j2 > Note : Things work fine with log4j1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42479) Log4j2 doesn't works with Spark 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-42479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42479. -- Resolution: Not A Problem > Log4j2 doesn't works with Spark 3.3.0 > - > > Key: SPARK-42479 > URL: https://issues.apache.org/jira/browse/SPARK-42479 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 3.2.1, 3.3.0 >Reporter: Pratik Malani >Priority: Major > > Hi All, > Was trying to run spark application on the cluster mode using log4j2 and > Spark 3.3.0. > When I run the below spark-submit command, only one worker (out of 3) starts > executing the job. > {code:java} > // code placeholder > spark-submit --master spark://spark-master-svc:7077 \ > --conf spark.cores.max=4 \ > --conf spark.sql.broadcastTimeout=3600 \ > --conf spark.executor.cores=1 \ > --jars /opt/spark/work-dir/.jar \ > --deploy-mode cluster \ > --class \ > --properties-file /opt/spark/conf/spark-defaults.conf \ > --conf > spark.driver.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true > -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \ > --conf > spark.executor.extraJavaOptions="-Dcom.amazonaws.sdk.disableCertChecking=true > -Dlog4j.configurationFile=file:/opt/spark/work-dir/.properties" \ > --files "/opt/spark/work-dir/.properties" \ > /opt/spark/work-dir/.jar > /opt/spark/work-dir/application.properties >> /var/log/containers/hourly.log > 2>&1 {code} > It means, on only one worker, I can see the driver logs and the other workers > are idle and there are no app or executors logs created on other workers. > Below is the log4j2.properties file being used. > {code:java} > // code placeholder > rootLogger.level = INFO > rootLogger.appenderRef.rolling.ref = loggerId > appender.rolling.type = RollingFile > appender.rolling.name = loggerId > appender.rolling.fileName=/var/log/containers/hourly.log > appender.rolling.filePattern=hourly-.%d{MMdd}.log.gz > appender.rolling.layout.type = PatternLayout > appender.rolling.layout.pattern=%d [%t] %-5p (%F:%L) - %m%n > appender.rolling.policies.type = Policies > appender.rolling.policies.size.type = TimeBasedTriggeringPolicy > appender.rolling.strategy.type = DefaultRolloverStrategy > appender.rolling.strategy.max = 5 > logger.spark.name = org.apache.spark > logger.spark.level = WARN > logger.spark.additivity = false > logger.spark.repl.SparkIMain$exprTyper.level = INFO > logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO > # Settings to quiet third party logs that are too verbose > logger.jetty.name = org.eclipse.jetty > logger.jetty.level = WARN > logger.jetty.util.component.AbstractLifeCycle.level = ERROR > logger.parquet.name = org.apache.parquet > logger.parquet.level = ERROR > logger.kafka.name = org.apache.kafka > logger.kafka.level = WARN > logger.kafka.clients.consumer.internals.Fetcher.level=WARN {code} > All log4j2 jars are included in the Spark home classpath under the jars > directory. > * log4j-1.2-api-2.17.2.jar > * log4j-api-2.17.2.jar > * log4j-api-scala_2.12-12.0.jar > * log4j-core-2.17.2.jar > * log4j-slf4j-impl-2.17.2.jar > Can you please check and let me know whether I need to add or update anything > to start the job in cluster mode supporting log4j2 > Note : Things work fine with log4j1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42607. -- Resolution: Not A Problem > [MESOS] OMP_NUM_THREADS not set to number of executor cores by default > -- > > Key: SPARK-42607 > URL: https://issues.apache.org/jira/browse/SPARK-42607 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > We could have similar issue to SPARK-42596 (YARN) in Mesos. > Could someone verify? Unfortunately I am not able to due to lack of infra. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699040#comment-17699040 ] Sean R. Owen commented on SPARK-42607: -- I don't think we're touching Mesos at this point - all but deprecated > [MESOS] OMP_NUM_THREADS not set to number of executor cores by default > -- > > Key: SPARK-42607 > URL: https://issues.apache.org/jira/browse/SPARK-42607 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > We could have similar issue to SPARK-42596 (YARN) in Mesos. > Could someone verify? Unfortunately I am not able to due to lack of infra. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42750) Support INSERT INTO by name
Jose Torres created SPARK-42750: --- Summary: Support INSERT INTO by name Key: SPARK-42750 URL: https://issues.apache.org/jira/browse/SPARK-42750 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jose Torres In some use cases, users have incoming dataframes with fixed column names which might differ from the canonical order. Currently there's no way to handle this easily through the INSERT INTO API - the user has to make sure the columns are in the right order as they would when inserting a tuple. We should add an optional BY NAME clause, such that: INSERT INTO tgt BY NAME takes each column of and inserts it into the column in `tgt` which has the same name according to the configured `resolver` logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42627) Spark: Getting SQLException: Unsupported type -102 reading from Oracle
[ https://issues.apache.org/jira/browse/SPARK-42627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699039#comment-17699039 ] Sean R. Owen commented on SPARK-42627: -- Not enough info here. What type is not supported? > Spark: Getting SQLException: Unsupported type -102 reading from Oracle > -- > > Key: SPARK-42627 > URL: https://issues.apache.org/jira/browse/SPARK-42627 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: melin >Priority: Major > > > {code:java} > Exception in thread "main" org.apache.spark.SparkSQLException: Unrecognized > SQL type -102 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.unrecognizedSqlTypeError(QueryExecutionErrors.scala:832) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:225) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:308) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:308) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:70) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:242) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171) > > {code} > oracle driver > {code:java} > > com.oracle.database.jdbc > ojdbc8 > 21.9.0.0 > {code} > > oracle sql: > > {code:java} > CREATE TABLE "ORDERS" > ( "ORDER_ID" NUMBER(9,0) NOT NULL ENABLE, > "ORDER_DATE" TIMESTAMP (3) WITH LOCAL TIME ZONE NOT NULL ENABLE, > "CUSTOMER_NAME" VARCHAR2(255) NOT NULL ENABLE, > "PRICE" NUMBER(10,5) NOT NULL ENABLE, > "PRODUCT_ID" NUMBER(9,0) NOT NULL ENABLE, > "ORDER_STATUS" NUMBER(1,0) NOT NULL ENABLE, > PRIMARY KEY ("ORDER_ID") > USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS > STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 > PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE > DEFAULT CELL_FLASH_CACHE DEFAULT) > TABLESPACE "LOGMINER_TBS" ENABLE, > SUPPLEMENTAL LOG DATA (ALL) COLUMNS > ) SEGMENT CREATION IMMEDIATE > PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING > STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 > PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE > DEFAULT CELL_FLASH_CACHE DEFAULT) > TABLESPACE "LOGMINER_TBS" > > {code} > [~beliefer] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42714) Sparksql temporary file conflict
[ https://issues.apache.org/jira/browse/SPARK-42714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699038#comment-17699038 ] Sean R. Owen commented on SPARK-42714: -- Not really enough info here. How does it happen? > Sparksql temporary file conflict > > > Key: SPARK-42714 > URL: https://issues.apache.org/jira/browse/SPARK-42714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: hao >Priority: Major > > When sparksql inserts overwrite, the name of the temporary file in the middle > is not unique. This will cause that when multiple applications write > different partition data to the same partition table, it will be possible to > delete each other's temporary files between applications, resulting in task > failure -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42749) CAST(x as int) does not generate error with overflow
[ https://issues.apache.org/jira/browse/SPARK-42749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tjomme Vergauwen updated SPARK-42749: - Attachment: Spark-42749.PNG > CAST(x as int) does not generate error with overflow > > > Key: SPARK-42749 > URL: https://issues.apache.org/jira/browse/SPARK-42749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.3.1, 3.3.2 > Environment: It was tested on a DataBricks environment with DBR 10.4 > and above, running Spark v3.2.1 and above. >Reporter: Tjomme Vergauwen >Priority: Major > Attachments: Spark-42749.PNG > > > Hi, > When performing the following code: > {{select cast(7.415246799222789E19 as int)}} > according to the documentation, an error is expected as > {{7.415246799222789E19 }}is an overflow value for datatype INT. > However, the value 2147483647 is returned. > The behaviour of the following is correct as it returns NULL: > {{select try_cast(7.415246799222789E19 as int) }} > This results in unexpected behaviour and data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42749) CAST(x as int) does not generate error with overflow
Tjomme Vergauwen created SPARK-42749: Summary: CAST(x as int) does not generate error with overflow Key: SPARK-42749 URL: https://issues.apache.org/jira/browse/SPARK-42749 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.2.1 Environment: It was tested on a DataBricks environment with DBR 10.4 and above, running Spark v3.2.1 and above. Reporter: Tjomme Vergauwen Hi, When performing the following code: {{select cast(7.415246799222789E19 as int)}} according to the documentation, an error is expected as {{7.415246799222789E19 }}is an overflow value for datatype INT. However, the value 2147483647 is returned. The behaviour of the following is correct as it returns NULL: {{select try_cast(7.415246799222789E19 as int) }} This results in unexpected behaviour and data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41498) Union does not propagate Metadata output
[ https://issues.apache.org/jira/browse/SPARK-41498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699010#comment-17699010 ] Apache Spark commented on SPARK-41498: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40371 > Union does not propagate Metadata output > > > Key: SPARK-41498 > URL: https://issues.apache.org/jira/browse/SPARK-41498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: Fredrik Klauß >Assignee: Fredrik Klauß >Priority: Major > Fix For: 3.4.0 > > > Currently, the Union operator does not propagate any metadata output. This > makes it impossible to access any metadata if a Union operator is used, even > though the children have the exact same metadata output. > Example: > > {code:java} > val df1 = spark.read.load(path1) > val df2 = spark.read.load(path2) > df1.union(df2).select("_metadata.file_path"). // <-- fails{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time
[ https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42620: Assignee: (was: Apache Spark) > Add `inclusive` parameter for (DataFrame|Series).between_time > - > > Key: SPARK-42620 > URL: https://issues.apache.org/jira/browse/SPARK-42620 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/pandas-dev/pandas/pull/43248 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time
[ https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42620: Assignee: Apache Spark > Add `inclusive` parameter for (DataFrame|Series).between_time > - > > Key: SPARK-42620 > URL: https://issues.apache.org/jira/browse/SPARK-42620 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > See https://github.com/pandas-dev/pandas/pull/43248 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time
[ https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698999#comment-17698999 ] Apache Spark commented on SPARK-42620: -- User 'dzhigimont' has created a pull request for this issue: https://github.com/apache/spark/pull/40370 > Add `inclusive` parameter for (DataFrame|Series).between_time > - > > Key: SPARK-42620 > URL: https://issues.apache.org/jira/browse/SPARK-42620 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/pandas-dev/pandas/pull/43248 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42398) refine default column value framework
[ https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698982#comment-17698982 ] Apache Spark commented on SPARK-42398: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40369 > refine default column value framework > - > > Key: SPARK-42398 > URL: https://issues.apache.org/jira/browse/SPARK-42398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42748) Server-side Artifact Management
[ https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698952#comment-17698952 ] Apache Spark commented on SPARK-42748: -- User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/40368 > Server-side Artifact Management > --- > > Key: SPARK-42748 > URL: https://issues.apache.org/jira/browse/SPARK-42748 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side > transfer of artifacts to the server but currently, the server does not > process these requests. > > We need to implement a server-side management mechanism to handle storage of > these artifacts on the driver as well as perform further processing (such as > adding jars and moving class files to the right directories) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42748) Server-side Artifact Management
[ https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42748: Assignee: Apache Spark > Server-side Artifact Management > --- > > Key: SPARK-42748 > URL: https://issues.apache.org/jira/browse/SPARK-42748 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Apache Spark >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side > transfer of artifacts to the server but currently, the server does not > process these requests. > > We need to implement a server-side management mechanism to handle storage of > these artifacts on the driver as well as perform further processing (such as > adding jars and moving class files to the right directories) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42748) Server-side Artifact Management
[ https://issues.apache.org/jira/browse/SPARK-42748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42748: Assignee: (was: Apache Spark) > Server-side Artifact Management > --- > > Key: SPARK-42748 > URL: https://issues.apache.org/jira/browse/SPARK-42748 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side > transfer of artifacts to the server but currently, the server does not > process these requests. > > We need to implement a server-side management mechanism to handle storage of > these artifacts on the driver as well as perform further processing (such as > adding jars and moving class files to the right directories) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42748) Server-side Artifact Management
Venkata Sai Akhil Gudesa created SPARK-42748: Summary: Server-side Artifact Management Key: SPARK-42748 URL: https://issues.apache.org/jira/browse/SPARK-42748 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa https://issues.apache.org/jira/browse/SPARK-42653 implements the client-side transfer of artifacts to the server but currently, the server does not process these requests. We need to implement a server-side management mechanism to handle storage of these artifacts on the driver as well as perform further processing (such as adding jars and moving class files to the right directories) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42747) Fix incorrect internal status of LoR and AFT
[ https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698943#comment-17698943 ] Apache Spark commented on SPARK-42747: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40367 > Fix incorrect internal status of LoR and AFT > > > Key: SPARK-42747 > URL: https://issues.apache.org/jira/browse/SPARK-42747 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > LoR and AFT applied internal status to optimize prediction/transform, but the > status is not correctly updated in some case: > {code:java} > from pyspark.sql import Row > from pyspark.ml.classification import * > from pyspark.ml.linalg import Vectors > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(0.0, 5.0)), > (0.0, 2.0, Vectors.dense(1.0, 2.0)), > (1.0, 3.0, Vectors.dense(2.0, 1.0)), > (0.0, 4.0, Vectors.dense(3.0, 3.0)), > ], > ["label", "weight", "features"], > ) > lor = LogisticRegression(weightCol="weight") > model = lor.fit(df) > # status changes 1 > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > model.setThreshold(t).transform(df) > # status changes 2 > [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, > 0.2, 0.5, 1.0]] > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > print(t) > model.setThreshold(t).transform(df).show() > # <- error results > {code} > results: > {code:java} > 0.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.1 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.2 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.5 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 1.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-++---
[jira] [Assigned] (SPARK-42747) Fix incorrect internal status of LoR and AFT
[ https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42747: Assignee: Apache Spark > Fix incorrect internal status of LoR and AFT > > > Key: SPARK-42747 > URL: https://issues.apache.org/jira/browse/SPARK-42747 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > > LoR and AFT applied internal status to optimize prediction/transform, but the > status is not correctly updated in some case: > {code:java} > from pyspark.sql import Row > from pyspark.ml.classification import * > from pyspark.ml.linalg import Vectors > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(0.0, 5.0)), > (0.0, 2.0, Vectors.dense(1.0, 2.0)), > (1.0, 3.0, Vectors.dense(2.0, 1.0)), > (0.0, 4.0, Vectors.dense(3.0, 3.0)), > ], > ["label", "weight", "features"], > ) > lor = LogisticRegression(weightCol="weight") > model = lor.fit(df) > # status changes 1 > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > model.setThreshold(t).transform(df) > # status changes 2 > [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, > 0.2, 0.5, 1.0]] > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > print(t) > model.setThreshold(t).transform(df).show() > # <- error results > {code} > results: > {code:java} > 0.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.1 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.2 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.5 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 1.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820
[jira] [Assigned] (SPARK-42747) Fix incorrect internal status of LoR and AFT
[ https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42747: Assignee: (was: Apache Spark) > Fix incorrect internal status of LoR and AFT > > > Key: SPARK-42747 > URL: https://issues.apache.org/jira/browse/SPARK-42747 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > LoR and AFT applied internal status to optimize prediction/transform, but the > status is not correctly updated in some case: > {code:java} > from pyspark.sql import Row > from pyspark.ml.classification import * > from pyspark.ml.linalg import Vectors > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(0.0, 5.0)), > (0.0, 2.0, Vectors.dense(1.0, 2.0)), > (1.0, 3.0, Vectors.dense(2.0, 1.0)), > (0.0, 4.0, Vectors.dense(3.0, 3.0)), > ], > ["label", "weight", "features"], > ) > lor = LogisticRegression(weightCol="weight") > model = lor.fit(df) > # status changes 1 > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > model.setThreshold(t).transform(df) > # status changes 2 > [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, > 0.2, 0.5, 1.0]] > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > print(t) > model.setThreshold(t).transform(df).show() > # <- error results > {code} > results: > {code:java} > 0.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.1 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.2 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.5 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 1.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) ---
[jira] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691 ] jiaan.geng deleted comment on SPARK-42691: was (Author: beliefer): I will take a look! > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42691: Assignee: (was: Apache Spark) > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42691: Assignee: Apache Spark > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698935#comment-17698935 ] Apache Spark commented on SPARK-42691: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40366 > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42745: --- Description: After SPARK-40086 / SPARK-42049 the following, simple subselect expression containing query: {noformat} select (select sum(id) from t1) {noformat} fails with: {noformat} 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60) at scala.runtime.Statics.anyHash(Statics.java:122) ... at org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249) at scala.runtime.Statics.anyHash(Statics.java:122) at scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416) at scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44) at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149) at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148) at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44) at scala.collection.mutable.HashTable.init(HashTable.scala:110) at scala.collection.mutable.HashTable.init$(HashTable.scala:89) at scala.collection.mutable.HashMap.init(HashMap.scala:44) at scala.collection.mutable.HashMap.readObject(HashMap.scala:195) ... at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:85) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {noformat} when DSv2 is enabled. > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > > After SPARK-40086 / SPARK-42049 the following, simple subselect expression > containing query: > {noformat} > select (select sum(id) from t1) > {noformat} > fails with: > {noformat} > 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 3.0 (TID 3) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60) > at scala.runtime.Statics.anyHash(Statics.java:122) > ... > at > org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249) > at scala.runtime.Statics.anyHash(Statics.java:122) > at > scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416) > at > scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416) > at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44) > at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149) > at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148) > at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44) > at scala.collection.mutable.HashTable.init(HashTable.scala:110) > at scala.collection.mutable.HashTable.init$(HashTable.scala:89) > at scala.collection.mutable.HashMap.init(HashMap.scala:44) > at scala.collection.mutable.HashMap.readObject(HashMap.scala:195) > ... > at java.io.ObjectInp
[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42745: --- Assignee: Peter Toth > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42745. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40364 [https://github.com/apache/spark/pull/40364] > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42747) Fix incorrect internal status of LoR and AFT
Ruifeng Zheng created SPARK-42747: - Summary: Fix incorrect internal status of LoR and AFT Key: SPARK-42747 URL: https://issues.apache.org/jira/browse/SPARK-42747 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.4.0 Reporter: Ruifeng Zheng LoR and AFT applied internal status to optimize prediction/transform, but the status is not correctly updated in some case: {code:java} from pyspark.sql import Row from pyspark.ml.classification import * from pyspark.ml.linalg import Vectors df = spark.createDataFrame( [ (1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)), ], ["label", "weight", "features"], ) lor = LogisticRegression(weightCol="weight") model = lor.fit(df) # status changes 1 for t in [0.0, 0.1, 0.2, 0.5, 1.0]: model.setThreshold(t).transform(df) # status changes 2 [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 0.2, 0.5, 1.0]] for t in [0.0, 0.1, 0.2, 0.5, 1.0]: print(t) model.setThreshold(t).transform(df).show() # <- error results {code} results: {code:java} 0.0 +-+--+-+++--+ |label|weight| features| rawPrediction| probability|prediction| +-+--+-+++--+ | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| +-+--+-+++--+ 0.1 +-+--+-+++--+ |label|weight| features| rawPrediction| probability|prediction| +-+--+-+++--+ | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| +-+--+-+++--+ 0.2 +-+--+-+++--+ |label|weight| features| rawPrediction| probability|prediction| +-+--+-+++--+ | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| +-+--+-+++--+ 0.5 +-+--+-+++--+ |label|weight| features| rawPrediction| probability|prediction| +-+--+-+++--+ | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| +-+--+-+++--+ 1.0 +-+--+-+++--+ |label|weight| features| rawPrediction| probability|prediction| +-+--+-+++--+ | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| +-+--+-+++--+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42743) Support analyze TimestampNTZ columns
[ https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42743. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40362 [https://github.com/apache/spark/pull/40362] > Support analyze TimestampNTZ columns > > > Key: SPARK-42743 > URL: https://issues.apache.org/jira/browse/SPARK-42743 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42743) Support analyze TimestampNTZ columns
[ https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42743: Assignee: Gengliang Wang (was: Apache Spark) > Support analyze TimestampNTZ columns > > > Key: SPARK-42743 > URL: https://issues.apache.org/jira/browse/SPARK-42743 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42743) Support analyze TimestampNTZ columns
[ https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42743: Assignee: Apache Spark (was: Gengliang Wang) > Support analyze TimestampNTZ columns > > > Key: SPARK-42743 > URL: https://issues.apache.org/jira/browse/SPARK-42743 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42743) Support analyze TimestampNTZ columns
[ https://issues.apache.org/jira/browse/SPARK-42743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698876#comment-17698876 ] Apache Spark commented on SPARK-42743: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40362 > Support analyze TimestampNTZ columns > > > Key: SPARK-42743 > URL: https://issues.apache.org/jira/browse/SPARK-42743 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698872#comment-17698872 ] jiaan.geng commented on SPARK-42691: I will take a look! > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42746) Add the LISTAGG() aggregate function
Max Gekk created SPARK-42746: Summary: Add the LISTAGG() aggregate function Key: SPARK-42746 URL: https://issues.apache.org/jira/browse/SPARK-42746 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk {{listagg()}} is a common and useful aggregation function to concatenate string values in a column, optionally by a certain order. The systems below have supported such function already: * Oracle: https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030 * Snowflake: https://docs.snowflake.com/en/sql-reference/functions/listagg * Amazon Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html * Google BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg Need to introduce this new aggregate in Spark, both as a regular aggregate and as a window function. Proposed syntax: {code:sql} LISTAGG( [ DISTINCT ] [, ] ) [ WITHIN GROUP ( ) ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42746) Add the LISTAGG() aggregate function
[ https://issues.apache.org/jira/browse/SPARK-42746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-42746: - Description: {{listagg()}} is a common and useful aggregation function to concatenate string values in a column, optionally by a certain order. The systems below have supported such function already: * Oracle: [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030] * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg] * Amazon Redshift: [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html] * Google BigQuery: [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg] Need to introduce this new aggregate in Spark, both as a regular aggregate and as a window function. Proposed syntax: {code:sql} LISTAGG( [ DISTINCT ] [, ] ) [ WITHIN GROUP ( ) ] {code} was: {{listagg()}} is a common and useful aggregation function to concatenate string values in a column, optionally by a certain order. The systems below have supported such function already: * Oracle: https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030 * Snowflake: https://docs.snowflake.com/en/sql-reference/functions/listagg * Amazon Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html * Google BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg Need to introduce this new aggregate in Spark, both as a regular aggregate and as a window function. Proposed syntax: {code:sql} LISTAGG( [ DISTINCT ] [, ] ) [ WITHIN GROUP ( ) ] {code} > Add the LISTAGG() aggregate function > > > Key: SPARK-42746 > URL: https://issues.apache.org/jira/browse/SPARK-42746 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > {{listagg()}} is a common and useful aggregation function to concatenate > string values in a column, optionally by a certain order. The systems below > have supported such function already: > * Oracle: > [https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030] > * Snowflake: [https://docs.snowflake.com/en/sql-reference/functions/listagg] > * Amazon Redshift: > [https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html] > * Google BigQuery: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg] > Need to introduce this new aggregate in Spark, both as a regular aggregate > and as a window function. > Proposed syntax: > {code:sql} > LISTAGG( [ DISTINCT ] [, ] ) [ WITHIN GROUP ( > ) ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42737) Shuffle files lost with graceful decommission fallback storage enabled
[ https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-42737: - Description: During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`: {code:bash} 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning. 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0) 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission. 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None) 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0) 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1) 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2) 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested 23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1 {code} The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate: {code:java} 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1. 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process... 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated. 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained. 23/03/09 15:22:4
[jira] [Updated] (SPARK-42737) Shuffle files lost with graceful decommission fallback storage enabled
[ https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-42737: - Description: During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`: {code:bash} 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning. 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0) 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission. 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None) 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0) 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1) 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2) 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested 23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1 {code} The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate: {code:java} 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1. 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process... 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated. 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained. 23/03/09 15:22:4
[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42745: Assignee: (was: Apache Spark) > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698851#comment-17698851 ] Apache Spark commented on SPARK-42745: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/40364 > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42745: Assignee: Apache Spark > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42744) delete uploaded file when job finish
[ https://issues.apache.org/jira/browse/SPARK-42744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698833#comment-17698833 ] Hu Ziqian commented on SPARK-42744: --- add pr https://github.com/apache/spark/pull/40363 > delete uploaded file when job finish > > > Key: SPARK-42744 > URL: https://issues.apache.org/jira/browse/SPARK-42744 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.2 >Reporter: Hu Ziqian >Priority: Major > > On kubernetes, spark use spark.kubernetes.file.upload.path to upload local > files to > Hadoop compatible file system. But spark do not delete those files at all. > In this issue, we add a configuration > spark.kubernetes.uploaded.file.delete.on.termination to delete those files by > driver when job finishs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42745: --- Summary: Improved AliasAwareOutputExpression works with DSv2 (was: Fix NPE after recent AliasAwareOutputExpression changes) > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42745) Fix NPE after recent AliasAwareOutputExpression changes
Peter Toth created SPARK-42745: -- Summary: Fix NPE after recent AliasAwareOutputExpression changes Key: SPARK-42745 URL: https://issues.apache.org/jira/browse/SPARK-42745 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.5.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42744) delete uploaded file when job finish
Hu Ziqian created SPARK-42744: - Summary: delete uploaded file when job finish Key: SPARK-42744 URL: https://issues.apache.org/jira/browse/SPARK-42744 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.2 Reporter: Hu Ziqian On kubernetes, spark use spark.kubernetes.file.upload.path to upload local files to Hadoop compatible file system. But spark do not delete those files at all. In this issue, we add a configuration spark.kubernetes.uploaded.file.delete.on.termination to delete those files by driver when job finishs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42743) Support analyze TimestampNTZ columns
Gengliang Wang created SPARK-42743: -- Summary: Support analyze TimestampNTZ columns Key: SPARK-42743 URL: https://issues.apache.org/jira/browse/SPARK-42743 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org