[jira] [Commented] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator
[ https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643706#comment-17643706 ] huldar chen commented on SPARK-41266: - You can try to use ANSI compliance: {code:java} spark.sql.ansi.enabled=true {code} > Spark does not parse timestamp strings when using the IN operator > - > > Key: SPARK-41266 > URL: https://issues.apache.org/jira/browse/SPARK-41266 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: Windows 10, Spark 3.2.1 with Java 11 >Reporter: Laurens Versluis >Priority: Major > > Likely affects more versions, tested only with 3.2.1. > > Summary: > Spark will convert a timestamp string to a timestamp when using the equal > operator (=), yet won't do this when using the IN operator. > > Details: > While debugging an issue why we got no results on a query, we found out that > when using the equal symbol `=` in the WHERE clause combined with a > TimeStampType column that Spark will convert the string to a timestamp and > filter. > However, when using the IN operator (our query), it will not do so, and > perform a cast to string. We expected the behavior to be similar, or at least > that Spark realizes the IN clause operates on a TimeStampType column and thus > attempts to convert to timestamp first before falling back to string > comparison. > > *Minimal reproducible example:* > Suppose we have a one-line dataset with the follow contents and schema: > > {noformat} > ++ > |starttime | > ++ > |2019-08-11 19:33:05 | > ++ > root > |-- starttime: timestamp (nullable = true){noformat} > Then if we fire the following queries, we will not get results for the > IN-clause one using a timestamp string with timezone information: > > > {code:java} > // Works - Spark casts the argument to a string and the internal > representation of the time seems to match it... > singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show(); > // Works > singleCol.filter("starttime = '2019-08-11 19:33:05'").show(); > // Works > singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); > // Doesn't work > singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show(); > //Works > singleCol.filter("starttime IN > (to_timestamp('2019-08-11T19:33:05Z'))").show(); {code} > > We can see from the output that a cast to string is taking place: > {noformat} > [...] isnotnull(starttime#59),(cast(starttime#59 as string) = 2019-08-11 > 19:33:05){noformat} > Since the = operator does work, it would be consistent if operators such as > the IN operator would have similar, consistent behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643696#comment-17643696 ] Apache Spark commented on SPARK-41001: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38931 > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643695#comment-17643695 ] Apache Spark commented on SPARK-41001: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38931 > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40801) Upgrade Apache Commons Text to 1.10
[ https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643688#comment-17643688 ] Apache Spark commented on SPARK-40801: -- User 'cutiechi' has created a pull request for this issue: https://github.com/apache/spark/pull/38930 > Upgrade Apache Commons Text to 1.10 > --- > > Key: SPARK-40801 > URL: https://issues.apache.org/jira/browse/SPARK-40801 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.2.3, 3.3.2, 3.4.0 > > > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34987) AQE improve: change shuffle hash join to sort merge join when skewed shuffle hash join exists
[ https://issues.apache.org/jira/browse/SPARK-34987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34987. - Resolution: Not A Problem > AQE improve: change shuffle hash join to sort merge join when skewed shuffle > hash join exists > - > > Key: SPARK-34987 > URL: https://issues.apache.org/jira/browse/SPARK-34987 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.1, 3.2.0 >Reporter: exmy >Priority: Minor > > In our production, `spark.sql.join.preferSortMergeJoin` is false by default. > AQE currently can only optimize skewed join for sort merge join, it will be > better if we can change shuffle hash join to sort merge join when skewed > shuffle hash join exists. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41346) Implement asc and desc methods
[ https://issues.apache.org/jira/browse/SPARK-41346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643686#comment-17643686 ] Apache Spark commented on SPARK-41346: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38929 > Implement asc and desc methods > -- > > Key: SPARK-41346 > URL: https://issues.apache.org/jira/browse/SPARK-41346 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41034) Connect DataFrame should require RemoteSparkSession
[ https://issues.apache.org/jira/browse/SPARK-41034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643676#comment-17643676 ] Apache Spark commented on SPARK-41034: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38928 > Connect DataFrame should require RemoteSparkSession > --- > > Key: SPARK-41034 > URL: https://issues.apache.org/jira/browse/SPARK-41034 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41369: Assignee: Apache Spark > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Apache Spark >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41369: Assignee: (was: Apache Spark) > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41369: -- Assignee: (was: Venkata Sai Akhil Gudesa) Reverted at https://github.com/apache/spark/commit/324d0909623db5fd5abadcf5e8116a6ba1211ba2 > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41369: - Fix Version/s: (was: 3.4.0) > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41244) Introducing a Protobuf serializer for UI data on KV store
[ https://issues.apache.org/jira/browse/SPARK-41244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41244. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38779 [https://github.com/apache/spark/pull/38779] > Introducing a Protobuf serializer for UI data on KV store > - > > Key: SPARK-41244 > URL: https://issues.apache.org/jira/browse/SPARK-41244 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Introducing Protobuf serializer for KV store, which is 3 times as fast as the > default serializer according to end-to-end benchmark against RocksDB. > To move fast and make review easier, the first PR will cover the class > `JobDataWrapper` only. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643667#comment-17643667 ] Apache Spark commented on SPARK-41369: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38927 > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.0 > > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41402) Override nodeName of StringDecode
Xinrong Meng created SPARK-41402: Summary: Override nodeName of StringDecode Key: SPARK-41402 URL: https://issues.apache.org/jira/browse/SPARK-41402 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Override nodeName of StringDecode for clarity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41401) spark2 stagedir can't be change
[ https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sinlang updated SPARK-41401: Description: i want't change different staging dir when write temporary data using , but spark3 seen can only write in table path spark.yarn.stagingDir parameter only work when use spark2 in org.apache.spark.internal.io.FileCommitProtocol file : def getStagingDir(path: String, jobId: String): Path = { new Path(path, ".spark-staging-" + jobId) } } was: i want't change different staging dir when write temporary data using , but spark3 seen can only write in table path spark.yarn.stagingDir parameter only work when use spark2 in FileCommitProtocol file : def getStagingDir(path: String, jobId: String): Path = { new Path(path, ".spark-staging-" + jobId) } } > spark2 stagedir can't be change > > > Key: SPARK-41401 > URL: https://issues.apache.org/jira/browse/SPARK-41401 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.2.3 >Reporter: sinlang >Priority: Major > > i want't change different staging dir when write temporary data using , but > spark3 seen can only write in table path > spark.yarn.stagingDir parameter only work when use spark2 > > in org.apache.spark.internal.io.FileCommitProtocol file : > def getStagingDir(path: String, jobId: String): Path = { > new Path(path, ".spark-staging-" + jobId) > } > } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41401) spark2 stagedir can't be change
[ https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sinlang updated SPARK-41401: Description: i want't change different staging dir when write temporary data using , but spark3 seen can only write in table path spark.yarn.stagingDir parameter only work when use spark2 in FileCommitProtocol file : def getStagingDir(path: String, jobId: String): Path = { new Path(path, ".spark-staging-" + jobId) } } was: i want't change different staging dir when write temporary data using , but spark3 seen can only write in table path spark.yarn.stagingDir parameter only work when use spark2 !image-2022-12-06-11-31-29-723.png! > spark2 stagedir can't be change > > > Key: SPARK-41401 > URL: https://issues.apache.org/jira/browse/SPARK-41401 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.2.3 >Reporter: sinlang >Priority: Major > > i want't change different staging dir when write temporary data using , but > spark3 seen can only write in table path > spark.yarn.stagingDir parameter only work when use spark2 > > in FileCommitProtocol file : > def getStagingDir(path: String, jobId: String): Path = { > new Path(path, ".spark-staging-" + jobId) > } > } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41401) spark2 stagedir can't be change
sinlang created SPARK-41401: --- Summary: spark2 stagedir can't be change Key: SPARK-41401 URL: https://issues.apache.org/jira/browse/SPARK-41401 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.2, 3.2.3 Reporter: sinlang i want't change different staging dir when write temporary data using , but spark3 seen can only write in table path spark.yarn.stagingDir parameter only work when use spark2 !image-2022-12-06-11-31-29-723.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41247) Unify the protobuf versions in Spark connect and protobuf connector
[ https://issues.apache.org/jira/browse/SPARK-41247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643638#comment-17643638 ] Apache Spark commented on SPARK-41247: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38926 > Unify the protobuf versions in Spark connect and protobuf connector > --- > > Key: SPARK-41247 > URL: https://issues.apache.org/jira/browse/SPARK-41247 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.4.0 > > > Make the two versions consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41399. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38925 [https://github.com/apache/spark/pull/38925] > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41399: -- Component/s: Tests > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41399: - Assignee: Rui Wang > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643614#comment-17643614 ] Apache Spark commented on SPARK-41399: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38925 > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643613#comment-17643613 ] Apache Spark commented on SPARK-41399: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38925 > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41399: Assignee: (was: Apache Spark) > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41399: Assignee: Apache Spark > Refactor column related tests to test_connect_column > > > Key: SPARK-41399 > URL: https://issues.apache.org/jira/browse/SPARK-41399 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41400) Split of API classes from Catalyst
Herman van Hövell created SPARK-41400: - Summary: Split of API classes from Catalyst Key: SPARK-41400 URL: https://issues.apache.org/jira/browse/SPARK-41400 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell For the Spark Connect Scala Client we need a couple of classes that currently reside in Catalyst to be moved to a new sql/api project. Concretely the following classes will be moved: * Row * DataType the entire hierarchy * Encoder -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41399) Refactor column related tests to test_connect_column
Rui Wang created SPARK-41399: Summary: Refactor column related tests to test_connect_column Key: SPARK-41399 URL: https://issues.apache.org/jira/browse/SPARK-41399 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-41369: - Assignee: Venkata Sai Akhil Gudesa > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.0 > > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41369) Refactor connect directory structure
[ https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-41369. --- Fix Version/s: 3.4.0 Resolution: Resolved > Refactor connect directory structure > > > Key: SPARK-41369 > URL: https://issues.apache.org/jira/browse/SPARK-41369 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2, 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.0 > > > Currently, `spark/connector/connect/` is a single module that contains both > the "server"/service as well as the protobuf definitions. > However, this module can be split into multiple modules - "server" and > "common". This brings the advantage of separating out the protobuf generation > from the core "server" module for efficient reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
[ https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643611#comment-17643611 ] Apache Spark commented on SPARK-41398: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/38924 > Relax constraints on Storage-Partitioned Join when partition keys after > runtime filtering do not match > -- > > Key: SPARK-41398 > URL: https://issues.apache.org/jira/browse/SPARK-41398 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
[ https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41398: Assignee: Apache Spark > Relax constraints on Storage-Partitioned Join when partition keys after > runtime filtering do not match > -- > > Key: SPARK-41398 > URL: https://issues.apache.org/jira/browse/SPARK-41398 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
[ https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41398: Assignee: (was: Apache Spark) > Relax constraints on Storage-Partitioned Join when partition keys after > runtime filtering do not match > -- > > Key: SPARK-41398 > URL: https://issues.apache.org/jira/browse/SPARK-41398 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
[ https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643610#comment-17643610 ] Apache Spark commented on SPARK-41398: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/38924 > Relax constraints on Storage-Partitioned Join when partition keys after > runtime filtering do not match > -- > > Key: SPARK-41398 > URL: https://issues.apache.org/jira/browse/SPARK-41398 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
Chao Sun created SPARK-41398: Summary: Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match Key: SPARK-41398 URL: https://issues.apache.org/jira/browse/SPARK-41398 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.1 Reporter: Chao Sun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41397: Assignee: Apache Spark > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643594#comment-17643594 ] Apache Spark commented on SPARK-41397: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/38921 > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41397: Assignee: (was: Apache Spark) > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41397) Implement part of String/Binary functions
Xinrong Meng created SPARK-41397: Summary: Implement part of String/Binary functions Key: SPARK-41397 URL: https://issues.apache.org/jira/browse/SPARK-41397 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41397) Implement part of string/binary functions
[ https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-41397: - Summary: Implement part of string/binary functions (was: Implement part of String/Binary functions) > Implement part of string/binary functions > - > > Key: SPARK-41397 > URL: https://issues.apache.org/jira/browse/SPARK-41397 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643588#comment-17643588 ] Apache Spark commented on SPARK-41395: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/38923 > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forced codegen to fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolea
[jira] [Assigned] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41395: Assignee: (was: Apache Spark) > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forced codegen to fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolean' > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at >
[jira] [Commented] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643587#comment-17643587 ] Apache Spark commented on SPARK-41395: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/38923 > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forced codegen to fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolea
[jira] [Assigned] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41395: Assignee: Apache Spark > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forced codegen to fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolean' > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture
[jira] [Resolved] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41394. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38920 [https://github.com/apache/spark/pull/38920] > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41394: - Assignee: Dongjoon Hyun > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41395: -- Description: The following returns the wrong answer: {noformat} set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); +-+-+ |max(col1)|max(col2)| +-+-+ |null |239.88 | +-+-+ {noformat} This is because {{InterpretedMutableProjection}} inappropriately uses {{InternalRow#setNullAt}} to set null for decimal types with precision > {{Decimal.MAX_LONG_DIGITS}}. The path to corruption goes like this: Unsafe buffer at start: {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 1800 2800 {noformat} When processing the first incoming row ([null, null]), {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As a result, the pointers to the storage areas for the two decimals in the variable length region get zeroed out. Buffer after projecting first row (null, null): {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 {noformat} When it's time to project the second row into the buffer, UnsafeRow#setDecimal uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal data: {noformat} null-tracking bit area offset: 0816 (0x10)24 (0x18)32 (0x20) data: 5db4 0200 {noformat} The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which turns off the null-tracking bit associated with the field at index 1. In addition, the decimal at field index 0 is now null because of the corruption of the null-tracking bit set. When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather than call {{setNullAt}} (see.) This bug could get exercised during codegen fallback. Take for example this case where I forced codegen to fail for the {{Greatest}} expression: {noformat} spark-sql> select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' at org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) at org.codehaus.janino.Parser.read(Parser.java:3787) ... 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back to interpreter mode java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean' at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 36 more ... NULL239.88 <== incorrect result, should be (77.77, 245.00) Time taken: 6.132 seconds, Fetched 1 row(s) spark-sql> {nofor
[jira] [Commented] (SPARK-41396) Oneof field support and recursive fields
[ https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643568#comment-17643568 ] Apache Spark commented on SPARK-41396: -- User 'SandishKumarHN' has created a pull request for this issue: https://github.com/apache/spark/pull/38922 > Oneof field support and recursive fields > > > Key: SPARK-41396 > URL: https://issues.apache.org/jira/browse/SPARK-41396 > Project: Spark > Issue Type: Task > Components: Protobuf >Affects Versions: 2.3.0 >Reporter: Sandish Kumar HN >Priority: Major > > we should add support for protobuf OneOf fields to Spark-Protobuf. This will > involve implementing logic to detect when a protobuf message contains a OneOf > field, and to handle it appropriately when using from_protobuf and > to_protobuf. > we should add unit tests to ensure that the implementation of protobuf OneOf > field support is correct. > Users can use protobuf OneOf fields with Spark-protobuf, making it more > complete and useful for processing protobuf data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41396) Oneof field support and recursive fields
[ https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41396: Assignee: Apache Spark > Oneof field support and recursive fields > > > Key: SPARK-41396 > URL: https://issues.apache.org/jira/browse/SPARK-41396 > Project: Spark > Issue Type: Task > Components: Protobuf >Affects Versions: 2.3.0 >Reporter: Sandish Kumar HN >Assignee: Apache Spark >Priority: Major > > we should add support for protobuf OneOf fields to Spark-Protobuf. This will > involve implementing logic to detect when a protobuf message contains a OneOf > field, and to handle it appropriately when using from_protobuf and > to_protobuf. > we should add unit tests to ensure that the implementation of protobuf OneOf > field support is correct. > Users can use protobuf OneOf fields with Spark-protobuf, making it more > complete and useful for processing protobuf data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41396) Oneof field support and recursive fields
[ https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643567#comment-17643567 ] Apache Spark commented on SPARK-41396: -- User 'SandishKumarHN' has created a pull request for this issue: https://github.com/apache/spark/pull/38922 > Oneof field support and recursive fields > > > Key: SPARK-41396 > URL: https://issues.apache.org/jira/browse/SPARK-41396 > Project: Spark > Issue Type: Task > Components: Protobuf >Affects Versions: 2.3.0 >Reporter: Sandish Kumar HN >Priority: Major > > we should add support for protobuf OneOf fields to Spark-Protobuf. This will > involve implementing logic to detect when a protobuf message contains a OneOf > field, and to handle it appropriately when using from_protobuf and > to_protobuf. > we should add unit tests to ensure that the implementation of protobuf OneOf > field support is correct. > Users can use protobuf OneOf fields with Spark-protobuf, making it more > complete and useful for processing protobuf data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41396) Oneof field support and recursive fields
[ https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41396: Assignee: (was: Apache Spark) > Oneof field support and recursive fields > > > Key: SPARK-41396 > URL: https://issues.apache.org/jira/browse/SPARK-41396 > Project: Spark > Issue Type: Task > Components: Protobuf >Affects Versions: 2.3.0 >Reporter: Sandish Kumar HN >Priority: Major > > we should add support for protobuf OneOf fields to Spark-Protobuf. This will > involve implementing logic to detect when a protobuf message contains a OneOf > field, and to handle it appropriately when using from_protobuf and > to_protobuf. > we should add unit tests to ensure that the implementation of protobuf OneOf > field support is correct. > Users can use protobuf OneOf fields with Spark-protobuf, making it more > complete and useful for processing protobuf data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41396) Oneof field support and recursive fields
Sandish Kumar HN created SPARK-41396: Summary: Oneof field support and recursive fields Key: SPARK-41396 URL: https://issues.apache.org/jira/browse/SPARK-41396 Project: Spark Issue Type: Task Components: Protobuf Affects Versions: 2.3.0 Reporter: Sandish Kumar HN we should add support for protobuf OneOf fields to Spark-Protobuf. This will involve implementing logic to detect when a protobuf message contains a OneOf field, and to handle it appropriately when using from_protobuf and to_protobuf. we should add unit tests to ensure that the implementation of protobuf OneOf field support is correct. Users can use protobuf OneOf fields with Spark-protobuf, making it more complete and useful for processing protobuf data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41395: -- Affects Version/s: 3.3.1 > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forcibly made codegen fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolean' > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google
[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41395: -- Affects Version/s: 3.2.3 > InterpretedMutableProjection can corrupt unsafe buffer when used with decimal > data > -- > > Key: SPARK-41395 > URL: https://issues.apache.org/jira/browse/SPARK-41395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > The following returns the wrong answer: > {noformat} > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > +-+-+ > |max(col1)|max(col2)| > +-+-+ > |null |239.88 | > +-+-+ > {noformat} > This is because {{InterpretedMutableProjection}} inappropriately uses > {{InternalRow#setNullAt}} to set null for decimal types with precision > > {{Decimal.MAX_LONG_DIGITS}}. > The path to corruption goes like this: > Unsafe buffer at start: > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 1800 2800 > > {noformat} > When processing the first incoming row ([null, null]), > {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. > As a result, the pointers to the storage areas for the two decimals in the > variable length region get zeroed out. > Buffer after projecting first row (null, null): > {noformat} > offset/len for offset/len for > 1st decimal 2nd decimal > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 0300 > > {noformat} > When it's time to project the second row into the buffer, > UnsafeRow#setDecimal uses the zero offsets, which causes > {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal > data: > {noformat} > null-tracking > bit area > offset: 0816 (0x10)24 (0x18) > 32 (0x20) > data: 5db4 0200 > > {noformat} > The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than > 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which > turns off the null-tracking bit associated with the field at index 1. > In addition, the decimal at field index 0 is now null because of the > corruption of the null-tracking bit set. > When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, > {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather > than call {{setNullAt}} (see.) > This bug could get exercised during codegen fallback. Take for example this > case where I forcibly made codegen fail for the {{Greatest}} expression: > {noformat} > spark-sql> select max(col1), max(col2) from values > (cast(null as decimal(27,2)), cast(null as decimal(27,2))), > (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) > as data(col1, col2); > 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 1: ';' expected instead of 'if' > at > org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) > at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) > at org.codehaus.janino.Parser.read(Parser.java:3787) > ... > 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back > to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 43, Column 1: ';' expected instead of 'boolean' > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common
[jira] [Resolved] (SPARK-41390) Update the script used to generate register function in UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41390. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38916 [https://github.com/apache/spark/pull/38916] > Update the script used to generate register function in UDFRegistration > > > Key: SPARK-41390 > URL: https://issues.apache.org/jira/browse/SPARK-41390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} > instead of {{throw new AnalysisException(...)}} for {{register}} function in > {{{}UDFRegistration{}}}, but the script used to generate xx has not been > updated, so this pr update the script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41390: Assignee: Yang Jie > Update the script used to generate register function in UDFRegistration > > > Key: SPARK-41390 > URL: https://issues.apache.org/jira/browse/SPARK-41390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} > instead of {{throw new AnalysisException(...)}} for {{register}} function in > {{{}UDFRegistration{}}}, but the script used to generate xx has not been > updated, so this pr update the script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[ https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41395: -- Description: The following returns the wrong answer: {noformat} set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); +-+-+ |max(col1)|max(col2)| +-+-+ |null |239.88 | +-+-+ {noformat} This is because {{InterpretedMutableProjection}} inappropriately uses {{InternalRow#setNullAt}} to set null for decimal types with precision > {{Decimal.MAX_LONG_DIGITS}}. The path to corruption goes like this: Unsafe buffer at start: {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 1800 2800 {noformat} When processing the first incoming row ([null, null]), {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As a result, the pointers to the storage areas for the two decimals in the variable length region get zeroed out. Buffer after projecting first row (null, null): {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 {noformat} When it's time to project the second row into the buffer, UnsafeRow#setDecimal uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal data: {noformat} null-tracking bit area offset: 0816 (0x10)24 (0x18)32 (0x20) data: 5db4 0200 {noformat} The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which turns off the null-tracking bit associated with the field at index 1. In addition, the decimal at field index 0 is now null because of the corruption of the null-tracking bit set. When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather than call {{setNullAt}} (see.) This bug could get exercised during codegen fallback. Take for example this case where I forcibly made codegen fail for the {{Greatest}} expression: {noformat} spark-sql> select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' at org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) at org.codehaus.janino.Parser.read(Parser.java:3787) ... 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back to interpreter mode java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean' at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 36 more ... NULL239.88 <== incorrect result, should be (77.77, 245.00) Time taken: 6.132 seconds, Fetched 1 row(s) spark-sql> {n
[jira] [Created] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
Bruce Robbins created SPARK-41395: - Summary: InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data Key: SPARK-41395 URL: https://issues.apache.org/jira/browse/SPARK-41395 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins The following returns the wrong answer: {noformat} set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); +-+-+ |max(col1)|max(col2)| +-+-+ |null |239.88 | +-+-+ {noformat} This is because {{InterpretedMutableProjection}} inappropriately uses {{InternalRow#setNullAt}} to set null for decimal types with precision > {{Decimal.MAX_LONG_DIGITS}}. The path to corruption goes like this: Unsafe buffer at start: {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 1800 2800 {noformat} When processing the first incoming row ([null, null]), {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As a result, the pointers to the storage areas for the two decimals in the variable length region get zeroed out. Buffer after projecting first row (null, null): {noformat} offset/len for offset/len for 1st decimal 2nd decimal offset: 0816 (0x10)24 (0x18)32 (0x20) data: 0300 {noformat} When it's time to project the second row into the buffer, UnsafeRow#setDecimal uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal data: {noformat} null-tracking bit area offset: 0816 (0x10)24 (0x18)32 (0x20) data: 5db4 0200 {noformat} The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which turns off the null-tracking bit associated with the field at index 1. In addition, the decimal at field index 0 is now null because of the corruption of the null-tracking bit set. When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather than call {{setNullAt}} (see.) This bug could get exercised during codgen fallback. Take for example this case where I forcibly made codegen fail for the {{Greatest}} expression: {noformat} spark-sql> select max(col1), max(col2) from values (cast(null as decimal(27,2)), cast(null as decimal(27,2))), (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2))) as data(col1, col2); 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, Column 1: ';' expected instead of 'if' at org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149) at org.codehaus.janino.Parser.read(Parser.java:3787) ... 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back to interpreter mode java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean' at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.goog
[jira] [Comment Edited] (SPARK-18502) Spark does not handle columns that contain backquote (`)
[ https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643524#comment-17643524 ] Bjørn Jørgensen edited comment on SPARK-18502 at 12/5/22 7:36 PM: -- I just answered this problem in u...@spark.org df = spark.createDataFrame( [("china", "asia"), ("colombia", "south america`")], ["country", "continent`"] ) df.show() {code:java} // ++--+ | country|continent`| ++--+ | china| asia| |colombia|south america`| ++--+ {code} df.select("continent`").show(1) (...) AnalysisException: Syntax error in attribute name: continent`. clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns)) clean_df.show() {code:java} // ++--+ | country|continent_| ++--+ | china| asia| |colombia|south america`| ++--+ {code} clean_df.select("continent_").show(2) {code:java} // +--+ |continent_| +--+ | asia| |south america`| +--+ {code} Examples are from [MungingData Avoiding Dots / Periods in PySpark Column Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/] was (Author: bjornjorgensen): I just answered this problem in u...@spark.org df = spark.createDataFrame( [("china", "asia"), ("colombia", "south america`")], ["country", "continent`"] ) df.show() ++--+ | country| continent`| ++--+ | china| asia| |colombia|south america`| ++--+ df.select("continent`").show(1) (...)AnalysisException: Syntax error in attribute name: continent`. clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns)) clean_df.show() ++--+ | country| continent_| ++--+ | china| asia| |colombia|south america`| ++--+ clean_df.select("continent_").show(2) +--+ | continent_| +--+ | asia| |south america`| +--+ Examples are from [MungingData Avoiding Dots / Periods in PySpark Column Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/] > Spark does not handle columns that contain backquote (`) > > > Key: SPARK-18502 > URL: https://issues.apache.org/jira/browse/SPARK-18502 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Barry Becker >Priority: Minor > Labels: bulk-closed > > I know that if a column contains dots or hyphens we can put > backquotes/backticks around it, but what if the column contains a backtick > (`)? Can the back tick be escaped by some means? > Here is an example of the sort of error I see > {code} > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90) > org.apache.spark.sql.Column.(Column.scala:113) > org.apache.spark.sql.Column$.apply(Column.scala:36) > org.apache.spark.sql.functions$.min(functions.scala:407) > com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158) > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)
[ https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643524#comment-17643524 ] Bjørn Jørgensen commented on SPARK-18502: - I just answered this problem in u...@spark.org df = spark.createDataFrame( [("china", "asia"), ("colombia", "south america`")], ["country", "continent`"] ) df.show() ++--+ | country| continent`| ++--+ | china| asia| |colombia|south america`| ++--+ df.select("continent`").show(1) (...)AnalysisException: Syntax error in attribute name: continent`. clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns)) clean_df.show() ++--+ | country| continent_| ++--+ | china| asia| |colombia|south america`| ++--+ clean_df.select("continent_").show(2) +--+ | continent_| +--+ | asia| |south america`| +--+ Examples are from [MungingData Avoiding Dots / Periods in PySpark Column Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/] > Spark does not handle columns that contain backquote (`) > > > Key: SPARK-18502 > URL: https://issues.apache.org/jira/browse/SPARK-18502 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Barry Becker >Priority: Minor > Labels: bulk-closed > > I know that if a column contains dots or hyphens we can put > backquotes/backticks around it, but what if the column contains a backtick > (`)? Can the back tick be escaped by some means? > Here is an example of the sort of error I see > {code} > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90) > org.apache.spark.sql.Column.(Column.scala:113) > org.apache.spark.sql.Column$.apply(Column.scala:36) > org.apache.spark.sql.functions$.min(functions.scala:407) > com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158) > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643510#comment-17643510 ] Apache Spark commented on SPARK-41394: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38920 > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41394: Assignee: (was: Apache Spark) > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643509#comment-17643509 ] Apache Spark commented on SPARK-41394: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38920 > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
[ https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41394: Assignee: Apache Spark > Skip MemoryProfilerTests when pandas is not installed > - > > Key: SPARK-41394 > URL: https://issues.apache.org/jira/browse/SPARK-41394 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed
Dongjoon Hyun created SPARK-41394: - Summary: Skip MemoryProfilerTests when pandas is not installed Key: SPARK-41394 URL: https://issues.apache.org/jira/browse/SPARK-41394 Project: Spark Issue Type: Sub-task Components: PySpark, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta resolved SPARK-39257. --- Resolution: Not A Problem Issue is caused by mssql-jdbc driver and is fixed in 12.1.0 version using the PR [1942|https://github.com/microsoft/mssql-jdbc/pull/1942] > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at > com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at > com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) > at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at > com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at > com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309) > at > com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) > at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at > com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at > com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at > com.microsoft.sqlserver.jdbc.Column.ge
[jira] [Commented] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643501#comment-17643501 ] Sandeep Katta commented on SPARK-39257: --- Closing this jira as this is fixed by *mssql-jdbc [1942|https://github.com/microsoft/mssql-jdbc/pull/1942]* > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at > com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at > com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) > at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at > com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at > com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309) > at > com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) > at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at > com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at > com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at > com.microsoft.sqlserver.jdbc.Column.getValu
[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
[ https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643492#comment-17643492 ] Steve Loughran commented on SPARK-41392: MBP m1 with {code} uname -a Darwin stevel-MBP16 21.6.0 Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000 arm64 {code} java 8 {code} java -version openjdk version "1.8.0_322" OpenJDK Runtime Environment (Zulu 8.60.0.21-CA-macos-aarch64) (build 1.8.0_322-b06) OpenJDK 64-Bit Server VM (Zulu 8.60.0.21-CA-macos-aarch64) (build 25.322-b06, mixed mode) {code} build/mvn invokes homebrew maven which I run at -T 1 as sometimes the build hangs (maven bug, presumably) {code} build/mvn -v Using `mvn` from path: /opt/homebrew/bin/mvn Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63) Maven home: /opt/homebrew/Cellar/maven/3.8.6/libexec Java version: 1.8.0_322, vendor: Azul Systems, Inc., runtime: /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/jre Default locale: en_GB, platform encoding: UTF-8 OS name: "mac os x", version: "12.6.1", arch: "aarch64", family: "mac" {code} this setup works with older hadoop releases (inc the forthcoming 3.3.5), somehow the plugin can't cope with the trunk release > spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin > --- > > Key: SPARK-41392 > URL: https://issues.apache.org/jira/browse/SPARK-41392 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Minor > > on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE > {code} > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: > org/bouncycastle/jce/provider/BouncyCastleProvider > {code} > full stack > {code} > [ERROR] Failed to execute goal > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile > (scala-test-compile-first) on project spark-sql_2.12: Execution > scala-test-compile-first of goal > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required > class was missing while executing > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: > org/bouncycastle/jce/provider/BouncyCastleProvider > [ERROR] - > [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2 > [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy > [ERROR] urls[0] = > file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar > [ERROR] urls[1] = > file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar > [ERROR] urls[2] = > file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar > [ERROR] urls[3] = > file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar > [ERROR] urls[4] = > file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar > [ERROR] urls[5] = > file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar > [ERROR] urls[6] = > file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar > [ERROR] urls[7] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar > [ERROR] urls[8] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar > [ERROR] urls[9] = > file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar > [ERROR] urls[10] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar > [ERROR] urls[11] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar > [ERROR] urls[12] = > file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar > [ERROR] urls[13] = > file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar > [ERROR] urls[14] = > file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar > [ERROR] urls[15] = > file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar > [ERROR] urls[16] = > file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar > [ERROR] urls[17] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar > [ERROR] urls[18] = > file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar > [ERROR] urls[19] = > file:/Users/st
[jira] [Commented] (SPARK-41372) Support DataFrame TempView
[ https://issues.apache.org/jira/browse/SPARK-41372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643473#comment-17643473 ] Xinrong Meng commented on SPARK-41372: -- Resolved by https://github.com/apache/spark/pull/38891. > Support DataFrame TempView > -- > > Key: SPARK-41372 > URL: https://issues.apache.org/jira/browse/SPARK-41372 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41372) Support DataFrame TempView
[ https://issues.apache.org/jira/browse/SPARK-41372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-41372. -- Resolution: Resolved > Support DataFrame TempView > -- > > Key: SPARK-41372 > URL: https://issues.apache.org/jira/browse/SPARK-41372 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
[ https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643461#comment-17643461 ] Apache Spark commented on SPARK-40419: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38919 > Integrate Grouped Aggregate Pandas UDFs into *.sql test cases > - > > Key: SPARK-40419 > URL: https://issues.apache.org/jira/browse/SPARK-40419 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases > from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at > all. > We should also leverage this to test pandas aggregate UDFs too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
[ https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643414#comment-17643414 ] Yang Jie commented on SPARK-41392: -- Can you give the complete compilation command and the compilation tools used? For example, java version and maven version > spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin > --- > > Key: SPARK-41392 > URL: https://issues.apache.org/jira/browse/SPARK-41392 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Minor > > on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE > {code} > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: > org/bouncycastle/jce/provider/BouncyCastleProvider > {code} > full stack > {code} > [ERROR] Failed to execute goal > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile > (scala-test-compile-first) on project spark-sql_2.12: Execution > scala-test-compile-first of goal > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required > class was missing while executing > net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: > org/bouncycastle/jce/provider/BouncyCastleProvider > [ERROR] - > [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2 > [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy > [ERROR] urls[0] = > file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar > [ERROR] urls[1] = > file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar > [ERROR] urls[2] = > file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar > [ERROR] urls[3] = > file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar > [ERROR] urls[4] = > file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar > [ERROR] urls[5] = > file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar > [ERROR] urls[6] = > file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar > [ERROR] urls[7] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar > [ERROR] urls[8] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar > [ERROR] urls[9] = > file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar > [ERROR] urls[10] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar > [ERROR] urls[11] = > file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar > [ERROR] urls[12] = > file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar > [ERROR] urls[13] = > file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar > [ERROR] urls[14] = > file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar > [ERROR] urls[15] = > file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar > [ERROR] urls[16] = > file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar > [ERROR] urls[17] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar > [ERROR] urls[18] = > file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar > [ERROR] urls[19] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar > [ERROR] urls[20] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar > [ERROR] urls[21] = > file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar > [ERROR] urls[22] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar > [ERROR] urls[23] = > file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar > [ERROR] urls[24] = > file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar > [ERROR] urls[25] = > file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar > [ERROR] urls[26] = > file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar > [E
[jira] [Assigned] (SPARK-41393) Upgrade slf4j to 2.0.5
[ https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41393: Assignee: Apache Spark > Upgrade slf4j to 2.0.5 > -- > > Key: SPARK-41393 > URL: https://issues.apache.org/jira/browse/SPARK-41393 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://www.slf4j.org/news.html#2.0.5 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41393) Upgrade slf4j to 2.0.5
[ https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41393: Assignee: (was: Apache Spark) > Upgrade slf4j to 2.0.5 > -- > > Key: SPARK-41393 > URL: https://issues.apache.org/jira/browse/SPARK-41393 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html#2.0.5 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41393) Upgrade slf4j to 2.0.5
[ https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643411#comment-17643411 ] Apache Spark commented on SPARK-41393: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38918 > Upgrade slf4j to 2.0.5 > -- > > Key: SPARK-41393 > URL: https://issues.apache.org/jira/browse/SPARK-41393 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html#2.0.5 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41393) Upgrade slf4j to 2.0.5
Yang Jie created SPARK-41393: Summary: Upgrade slf4j to 2.0.5 Key: SPARK-41393 URL: https://issues.apache.org/jira/browse/SPARK-41393 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie https://www.slf4j.org/news.html#2.0.5 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41389) Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`
[ https://issues.apache.org/jira/browse/SPARK-41389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41389. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38913 [https://github.com/apache/spark/pull/38913] > Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044` > --- > > Key: SPARK-41389 > URL: https://issues.apache.org/jira/browse/SPARK-41389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41389) Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`
[ https://issues.apache.org/jira/browse/SPARK-41389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41389: Assignee: Yang Jie > Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044` > --- > > Key: SPARK-41389 > URL: https://issues.apache.org/jira/browse/SPARK-41389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40642) wrong doc on memory tuning regarding String object memory size, changed since version>=9
[ https://issues.apache.org/jira/browse/SPARK-40642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40642. -- Resolution: Won't Fix > wrong doc on memory tuning regarding String object memory size, changed since > version>=9 > > > Key: SPARK-40642 > URL: https://issues.apache.org/jira/browse/SPARK-40642 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2 >Reporter: Arnaud Nauwynck >Priority: Trivial > > The documentation is wrong regarding memory consumption of java.lang.String > https://spark.apache.org/docs/latest/tuning.html#memory-tuning > internally, the source for this doc section is written here: > https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100 > {noformat} > * Java `String`s have about 40 bytes of overhead over the raw string data > (since they store it in an > array of `Char`s and keep extra data such as the length), and store each > character > as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a > 10-character string can > easily consume 60 bytes. > {noformat} > reason: since java version >= 9 ... Java has optimized the problem described > in the doc. > It used to be 16 bytes of header + using internally char coded as UTF-16 > Notice that before jdk 9 (since jdk 6, there was also an internal flag for > HotSpot JVM : -XX:+UseCompressedStrings , but it was not enabled by default > ) > Since OpenJdk >= 9... with the implementation of JEP 254 ( > https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 > when they are simple Latin1 text, otherwise as before. There is now an extra > byte field in class java.lang.String to say if the "coder" is optimized for > Latin1. > This field is described here in OpenJdk source code: > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170 > The computation for the memory size of String was "40+2*charCount" ... it > is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it > is not Latin1 text > the object overhead is 44 because of alignment... not 40+1 for adding one > "byte" field -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
[ https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40141: - Priority: Minor (was: Major) > Task listener overloads no longer needed with JDK 8+ > > > Key: SPARK-40141 > URL: https://issues.apache.org/jira/browse/SPARK-40141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Minor > > TaskContext defines methods for registering completion and failure listeners, > and the respective listener types qualify as functional interfaces in JDK 8+. > This leads to awkward ambiguous overload errors with the overload of each > function, that takes a function directly instead of a listener. Now that JDK > 8 is the minimum allowed, we can remove the unnecessary overloads, which not > only simplifies the code, but also removes a source of frustration since it > can be nearly impossible to predict when an ambiguous overload might be > triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39948) exclude velocity 1.5 jar
[ https://issues.apache.org/jira/browse/SPARK-39948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39948. -- Resolution: Not A Problem > exclude velocity 1.5 jar > > > Key: SPARK-39948 > URL: https://issues.apache.org/jira/browse/SPARK-39948 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: melin >Priority: Major > > hive-exec depends on importing velocity. Velocity has an older version and > has many security issues > https://issues.apache.org/jira/browse/HIVE-25726 > > !image-2022-08-02-14-05-55-756.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
[ https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40141. -- Resolution: Won't Fix > Task listener overloads no longer needed with JDK 8+ > > > Key: SPARK-40141 > URL: https://issues.apache.org/jira/browse/SPARK-40141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > > TaskContext defines methods for registering completion and failure listeners, > and the respective listener types qualify as functional interfaces in JDK 8+. > This leads to awkward ambiguous overload errors with the overload of each > function, that takes a function directly instead of a listener. Now that JDK > 8 is the minimum allowed, we can remove the unnecessary overloads, which not > only simplifies the code, but also removes a source of frustration since it > can be nearly impossible to predict when an ambiguous overload might be > triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
[ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40284. -- Resolution: Not A Problem > spark concurrent overwrite mode writes data to files in HDFS format, all > request data write success > > > Key: SPARK-40284 > URL: https://issues.apache.org/jira/browse/SPARK-40284 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Liu >Priority: Major > > We use Spark as a service. The same Spark service needs to handle multiple > requests, but I have a problem with this > When multiple requests are overwritten to a directory at the same time, the > results of two overwrite requests may be written successfully. I think this > does not meet the definition of overwrite write > First I ran Write SQL1, then I ran Write SQL2, and I found that both data > were written in the end, which I thought was unreasonable > {code:java} > sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) > -- write sql1 > sparkSession.sql("select 1 as id, sleep(4) as > time").write.mode(SaveMode.Overwrite).parquet("path") > -- write sql2 > sparkSession.sql("select 2 as id, 1 as > time").write.mode(SaveMode.Overwrite).parquet("path") {code} > When the spark source, and I saw that all these logic in > InsertIntoHadoopFsRelationCommand this class. > > When the target directory already exists, Spark directly deletes the target > directory and writes to the _temporary directory that it requests. However, > when multiple requests are written, the data will all append in; For example, > in Write SQL above, this procedure occurs > 1. excute write sql1, spark create the _temporary directory for SQL1, and > continue > 2. excute write sql2 , spark will delete the entire target directory and > create its own > _temporary > 3. sql2 writes its data > 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id > directory does not exist and so the request fail. However, the task is > retried, but the _temporary directory is not deleted when the task is > retried. Therefore, the execution result of sql1 result is append to the > target directory > > Based on the above process, the write process, can spark do a directory > check before the write task or some other way to avoid this kind of problem? > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40253. -- Resolution: Won't Fix > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)
[jira] [Resolved] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40286. -- Resolution: Not A Problem > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36853) Code failing on checkstyle
[ https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36853. -- Resolution: Won't Fix > Code failing on checkstyle > -- > > Key: SPARK-36853 > URL: https://issues.apache.org/jira/browse/SPARK-36853 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Abhinav Kumar >Priority: Trivial > Attachments: image-2021-10-18-01-57-00-714.png, > spark_mvn_clean_install_skip_tests_in_windows.log > > > There are more - just pasting sample > > [INFO] There are 32 errors reported by Checkstyle 8.43 with > dev/checkstyle.xml ruleset. > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 107). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 116). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 104). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 125). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 109). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 134). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 114). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 143). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 119). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 152). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 124). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 161). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 129). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 170). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 134). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 179). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 139). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 188). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 144). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 197). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) > LineLength: Line is longer than 100 characters (found 149). > [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) > LineLength: Line is longer than 100 characters (found 206). > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] > (naming) MethodName: Method name 'ProcessingTime' must match pattern > '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] > (naming) MethodName: Method name 'ProcessingTime' must match pattern > '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] > (naming) MethodName: Method name 'ProcessingTime' must match pattern > '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] > (naming) MethodName: Method name 'ProcessingTime' must match pattern > '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] > (naming) MethodName: Method name 'Once' must match pattern > '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. > [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] > (naming) MethodName: Method name 'AvailableNow' must match pattern > '^[a-z][a-z0
[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643374#comment-17643374 ] Ahmed Mahran commented on SPARK-41008: -- Thanks, I'll manage to have a PR in a couple of days. > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Minor > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import IsotonicRegression as > IsotonicRegression_pyspark > # The P(positives | model_score): > # 0.6 -> 0.5 (1 out of the 2 labels is positive) > # 0.333 -> 0.333 (1 out of the 3 labels is positive) > # 0.20 -> 0.25 (1 out of the 4 labels is positive) > tc_pd = pd.DataFrame({ > "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], > > "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], > "weight": 1, } > ) > # The fraction of positives for each of the distinct model_scores would be > the best fit. > # Resulting in the following expected calibrated model_scores: > # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, > 0.25] > # The sklearn implementation of Isotonic Regression. > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > tc_regressor_sklearn = > IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], > sample_weight=tc_pd['weight']) > print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) > # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] > # The pyspark implementation of Isotonic Regression. > tc_df = spark.createDataFrame(tc_pd) > tc_df = tc_df.withColumn('model_score', > F.col('model_score').cast(DoubleType())) > isotonic_regressor_pyspark = > IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', > weightCol='weight') > tc_model = isotonic_regressor_pyspark.fit(tc_df) > tc_pd = tc_model.transform(tc_df).toPandas() > print("pyspark:", tc_pd['prediction'].values) > # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] > # The result from the pyspark implementation seems unclear. Similar small toy > examples lead to similar non-expected results for the pyspark implementation. > # Strangely enough, for 'large' datasets, the difference between calibrated > model_scores generated by both implementations dissapears. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643373#comment-17643373 ] Sean R. Owen commented on SPARK-41008: -- No need for an option, this seems like a bug fix. Yes if you can propose a pull request that fixes it, by all means. > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Minor > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import IsotonicRegression as > IsotonicRegression_pyspark > # The P(positives | model_score): > # 0.6 -> 0.5 (1 out of the 2 labels is positive) > # 0.333 -> 0.333 (1 out of the 3 labels is positive) > # 0.20 -> 0.25 (1 out of the 4 labels is positive) > tc_pd = pd.DataFrame({ > "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], > > "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], > "weight": 1, } > ) > # The fraction of positives for each of the distinct model_scores would be > the best fit. > # Resulting in the following expected calibrated model_scores: > # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, > 0.25] > # The sklearn implementation of Isotonic Regression. > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > tc_regressor_sklearn = > IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], > sample_weight=tc_pd['weight']) > print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) > # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] > # The pyspark implementation of Isotonic Regression. > tc_df = spark.createDataFrame(tc_pd) > tc_df = tc_df.withColumn('model_score', > F.col('model_score').cast(DoubleType())) > isotonic_regressor_pyspark = > IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', > weightCol='weight') > tc_model = isotonic_regressor_pyspark.fit(tc_df) > tc_pd = tc_model.transform(tc_df).toPandas() > print("pyspark:", tc_pd['prediction'].values) > # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] > # The result from the pyspark implementation seems unclear. Similar small toy > examples lead to similar non-expected results for the pyspark implementation. > # Strangely enough, for 'large' datasets, the difference between calibrated > model_scores generated by both implementations dissapears. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41167) Optimize LikeSimplification rule to improve multi like performance
[ https://issues.apache.org/jira/browse/SPARK-41167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-41167. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38682 [https://github.com/apache/spark/pull/38682] > Optimize LikeSimplification rule to improve multi like performance > -- > > Key: SPARK-41167 > URL: https://issues.apache.org/jira/browse/SPARK-41167 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Fix For: 3.4.0 > > > We can improve multi like by reorder the match expressions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41167) Optimize LikeSimplification rule to improve multi like performance
[ https://issues.apache.org/jira/browse/SPARK-41167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-41167: --- Assignee: Wan Kun > Optimize LikeSimplification rule to improve multi like performance > -- > > Key: SPARK-41167 > URL: https://issues.apache.org/jira/browse/SPARK-41167 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > > We can improve multi like by reorder the match expressions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643344#comment-17643344 ] Maziyar PANAHI commented on SPARK-32530: Not sure if this matters, but as a Scala developer myself primarily building Scala applications to use Apache Spark natively, I highly support this decision to have this as part of ASF officially. I also agree with a maintenance cost, however, unlike .NET, it's much easier for any of us from the Java/Scala world to contribute to Kotlin. I think it's a price that should be paid for the sake of longevity. It is clear that Java and Scala are not going anywhere, but they are not the first choice for newcomers either. More native languages on JVM likeKotlin can really help to bring more users and contributors to the Spark ecosystem in the long term. > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How long will it take? > A working implementation is already available, and if the community will > have any proposal of changes for this implementation to be improved, these > can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
Steve Loughran created SPARK-41392: -- Summary: spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin Key: SPARK-41392 URL: https://issues.apache.org/jira/browse/SPARK-41392 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.4.0 Reporter: Steve Loughran on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE {code} net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: org/bouncycastle/jce/provider/BouncyCastleProvider {code} full stack {code} [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile (scala-test-compile-first) on project spark-sql_2.12: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required class was missing while executing net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: org/bouncycastle/jce/provider/BouncyCastleProvider [ERROR] - [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2 [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy [ERROR] urls[0] = file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar [ERROR] urls[1] = file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar [ERROR] urls[2] = file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar [ERROR] urls[3] = file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar [ERROR] urls[4] = file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar [ERROR] urls[5] = file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar [ERROR] urls[6] = file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar [ERROR] urls[7] = file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar [ERROR] urls[8] = file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar [ERROR] urls[9] = file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar [ERROR] urls[10] = file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar [ERROR] urls[11] = file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar [ERROR] urls[12] = file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar [ERROR] urls[13] = file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar [ERROR] urls[14] = file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar [ERROR] urls[15] = file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar [ERROR] urls[16] = file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar [ERROR] urls[17] = file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar [ERROR] urls[18] = file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar [ERROR] urls[19] = file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar [ERROR] urls[20] = file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar [ERROR] urls[21] = file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar [ERROR] urls[22] = file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar [ERROR] urls[23] = file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar [ERROR] urls[24] = file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar [ERROR] urls[25] = file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar [ERROR] urls[26] = file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar [ERROR] urls[27] = file:/Users/stevel/.m2/repository/org/scala-lang/modules/scala-parallel-collections_2.13/0.2.0/scala-parallel-collections_2.13-0.2.0.jar [ERROR] urls[28] = file:/Users/stevel/.m2/repository/org/scala-sbt/io_2.13/1.7.0/io_2.13-1.7.0.jar [ERROR] urls[29] = file:/Users/stevel/.m2/repository/com/swoval/file-tree-views/2.1.9/file-tree-views-2.1.9.jar [ERROR] urls[30] = file:/Users/stevel/.m2/repository/net/java/dev/jna/jna/5.12.0/jna-5.12.0.jar [ERROR] urls[31] = file:/Users/stevel/.m2/repository/net/java/dev/jna/
[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643297#comment-17643297 ] Apache Spark commented on SPARK-41391: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38917 > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643296#comment-17643296 ] Apache Spark commented on SPARK-41391: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38917 > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41391: Assignee: (was: Apache Spark) > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41391: Assignee: Apache Spark > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41391: -- Summary: The output column name of `groupBy.agg(count_distinct)` is incorrect (was: The output column name of `groupby.agg(count_distinct)` is incorrect) > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41391) The output column name of `groupby.agg(count_distinct)` is incorrect
Ruifeng Zheng created SPARK-41391: - Summary: The output column name of `groupby.agg(count_distinct)` is incorrect Key: SPARK-41391 URL: https://issues.apache.org/jira/browse/SPARK-41391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41390: Assignee: Apache Spark > Update the script used to generate register function in UDFRegistration > > > Key: SPARK-41390 > URL: https://issues.apache.org/jira/browse/SPARK-41390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} > instead of {{throw new AnalysisException(...)}} for {{register}} function in > {{{}UDFRegistration{}}}, but the script used to generate xx has not been > updated, so this pr update the script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41390: Assignee: (was: Apache Spark) > Update the script used to generate register function in UDFRegistration > > > Key: SPARK-41390 > URL: https://issues.apache.org/jira/browse/SPARK-41390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} > instead of {{throw new AnalysisException(...)}} for {{register}} function in > {{{}UDFRegistration{}}}, but the script used to generate xx has not been > updated, so this pr update the script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41390) Update the script used to generate register function in UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643279#comment-17643279 ] Apache Spark commented on SPARK-41390: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38916 > Update the script used to generate register function in UDFRegistration > > > Key: SPARK-41390 > URL: https://issues.apache.org/jira/browse/SPARK-41390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} > instead of {{throw new AnalysisException(...)}} for {{register}} function in > {{{}UDFRegistration{}}}, but the script used to generate xx has not been > updated, so this pr update the script. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org