[jira] [Resolved] (SPARK-40777) Use error classes for Protobuf exceptions
[ https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40777. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38344 [https://github.com/apache/spark/pull/38344] > Use error classes for Protobuf exceptions > - > > Key: SPARK-40777 > URL: https://issues.apache.org/jira/browse/SPARK-40777 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Sandish Kumar HN >Priority: Major > Fix For: 3.4.0 > > > We should use error classes for all the exceptions. > A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972] > > cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40777) Use error classes for Protobuf exceptions
[ https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-40777: Assignee: Sandish Kumar HN > Use error classes for Protobuf exceptions > - > > Key: SPARK-40777 > URL: https://issues.apache.org/jira/browse/SPARK-40777 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Sandish Kumar HN >Priority: Major > > We should use error classes for all the exceptions. > A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972] > > cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41012: Assignee: (was: Apache Spark) > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41012: Assignee: Apache Spark > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628675#comment-17628675 ] Apache Spark commented on SPARK-41012: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/38508 > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628673#comment-17628673 ] Haejoon Lee commented on SPARK-41012: - I'm working on it > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
Haejoon Lee created SPARK-41012: --- Summary: Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE Key: SPARK-41012 URL: https://issues.apache.org/jira/browse/SPARK-41012 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41011) Refine Sequence#checkInputDataTypes related DataTypeMismatch
Yang Jie created SPARK-41011: Summary: Refine Sequence#checkInputDataTypes related DataTypeMismatch Key: SPARK-41011 URL: https://issues.apache.org/jira/browse/SPARK-41011 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40372: Assignee: Apache Spark > Migrate failures of array type checks onto error classes > > > Key: SPARK-40372 > URL: https://issues.apache.org/jira/browse/SPARK-40372 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in collection > expressions: > 1. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 > 2. ArrayContains (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 > 3. ArrayPosition (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 > 4. ElementAt (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 > 5. Concat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 > 6. Flatten (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 > 7. Sequence (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 > 8. ArrayRemove (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 > 9. ArrayDistinct (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40372: Assignee: (was: Apache Spark) > Migrate failures of array type checks onto error classes > > > Key: SPARK-40372 > URL: https://issues.apache.org/jira/browse/SPARK-40372 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in collection > expressions: > 1. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 > 2. ArrayContains (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 > 3. ArrayPosition (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 > 4. ElementAt (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 > 5. Concat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 > 6. Flatten (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 > 7. Sequence (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 > 8. ArrayRemove (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 > 9. ArrayDistinct (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41001: Assignee: Martin Grund > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41001. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38501 [https://github.com/apache/spark/pull/38501] > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40976. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38502 [https://github.com/apache/spark/pull/38502] > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40976: Assignee: Yang Jie > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628559#comment-17628559 ] Apache Spark commented on SPARK-41010: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38506 > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628557#comment-17628557 ] Apache Spark commented on SPARK-41010: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38506 > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41010: Assignee: Apache Spark > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41010: Assignee: (was: Apache Spark) > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB
[ https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628553#comment-17628553 ] Apache Spark commented on SPARK-40622: -- User 'liuzqt' has created a pull request for this issue: https://github.com/apache/spark/pull/38505 > Result of a single task in collect() must fit in 2GB > > > Key: SPARK-40622 > URL: https://issues.apache.org/jira/browse/SPARK-40622 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Ziqi Liu >Priority: Major > > when collecting results, data from single partition/task is serialized > through byte array or ByteBuffer(which is backed by byte array as well), > therefore it's subject to java array max size limit(in terms of byte array, > it's 2GB). > > Construct a single partition larger than 2GB and collect it can easily > reproduce the issue > {code:java} > // create data of size ~3GB in single partition, which exceeds the byte array > limit > // random gen to make sure it's poorly compressed > val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) > as data") > withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") { > withSQLConf("spark.sql.useChunkedBuffer" -> "true") { > df.queryExecution.executedPlan.executeCollect() > } > } {code} > will get a OOM error from > [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125] > > Consider using ChunkedByteBuffer to replace byte array in order to bypassing > this limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB
[ https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628552#comment-17628552 ] Apache Spark commented on SPARK-40622: -- User 'liuzqt' has created a pull request for this issue: https://github.com/apache/spark/pull/38505 > Result of a single task in collect() must fit in 2GB > > > Key: SPARK-40622 > URL: https://issues.apache.org/jira/browse/SPARK-40622 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Ziqi Liu >Priority: Major > > when collecting results, data from single partition/task is serialized > through byte array or ByteBuffer(which is backed by byte array as well), > therefore it's subject to java array max size limit(in terms of byte array, > it's 2GB). > > Construct a single partition larger than 2GB and collect it can easily > reproduce the issue > {code:java} > // create data of size ~3GB in single partition, which exceeds the byte array > limit > // random gen to make sure it's poorly compressed > val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) > as data") > withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") { > withSQLConf("spark.sql.useChunkedBuffer" -> "true") { > df.queryExecution.executedPlan.executeCollect() > } > } {code} > will get a OOM error from > [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125] > > Consider using ChunkedByteBuffer to replace byte array in order to bypassing > this limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40681) Update gson transitive dependency to 2.8.9 or later
[ https://issues.apache.org/jira/browse/SPARK-40681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628549#comment-17628549 ] Michael deLeon commented on SPARK-40681: Is there any update on when we might we this in a spark release ? > Update gson transitive dependency to 2.8.9 or later > --- > > Key: SPARK-40681 > URL: https://issues.apache.org/jira/browse/SPARK-40681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Andrew Kyle Purtell >Priority: Minor > > Spark 3.3 currently ships with GSON 2.8.6 and this should be managed up to > 2.8.9 or later. > Versions of GSON prior to 2.8.9 are subject to > [gson#1991|https://github.com/google/gson/pull/1991] , detected and reported > by several flavors of static vulnerability assessment tools, at a fairly high > score because it is a deserialization of untrusted data problem. > This issue is not meant to imply any particular security problem in Spark > itself. > {noformat} > [INFO] org.apache.spark:spark-network-common_2.12:jar:3.3.2-SNAPSHOT > [INFO] +- com.google.crypto.tink:tink:jar:1.6.1:compile > [INFO] | \- com.google.code.gson:gson:jar:2.8.6:compile > {noformat} > {noformat} > [INFO] org.apache.spark:spark-hive_2.12:jar:3.3.2-SNAPSHOT > [INFO] +- org.apache.hive:hive-exec:jar:core:2.3.9:compile > [INFO] | +- com.google.code.gson:gson:jar:2.2.4:compile > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits
[ https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628545#comment-17628545 ] Apache Spark commented on SPARK-40815: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/38504 > SymlinkTextInputFormat returns incorrect result due to enabled > spark.hadoopRDD.ignoreEmptySplits > > > Key: SPARK-40815 > URL: https://issues.apache.org/jira/browse/SPARK-40815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits
[ https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628546#comment-17628546 ] Apache Spark commented on SPARK-40815: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/38504 > SymlinkTextInputFormat returns incorrect result due to enabled > spark.hadoopRDD.ignoreEmptySplits > > > Key: SPARK-40815 > URL: https://issues.apache.org/jira/browse/SPARK-40815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41010) Complete Support for Except and Intersect in Python client
Rui Wang created SPARK-41010: Summary: Complete Support for Except and Intersect in Python client Key: SPARK-41010 URL: https://issues.apache.org/jira/browse/SPARK-41010 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40801) Upgrade Apache Commons Text to 1.10
[ https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40801: - Fix Version/s: 3.2.3 > Upgrade Apache Commons Text to 1.10 > --- > > Key: SPARK-40801 > URL: https://issues.apache.org/jira/browse/SPARK-40801 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0, 3.2.3, 3.3.2 > > > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40940: Assignee: (was: Apache Spark) > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40940: Assignee: Apache Spark > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Assignee: Apache Spark >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628472#comment-17628472 ] Wei Liu commented on SPARK-40940: - PR in: https://github.com/apache/spark/pull/38503 > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628474#comment-17628474 ] Apache Spark commented on SPARK-40940: -- User 'WweiL' has created a pull request for this issue: https://github.com/apache/spark/pull/38503 > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[ https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40869. --- Fix Version/s: 3.3.2 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 38331 [https://github.com/apache/spark/pull/38331] > KubernetesConf.getResourceNamePrefix creates invalid name prefixes > -- > > Key: SPARK-40869 > URL: https://issues.apache.org/jira/browse/SPARK-40869 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Tobias Stadler >Assignee: Tobias Stadler >Priority: Major > Fix For: 3.3.2, 3.2.3, 3.4.0 > > > If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it > generates an invalid name prefix, e.g. `-name-0123456789abcdef`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[ https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40869: - Assignee: Tobias Stadler > KubernetesConf.getResourceNamePrefix creates invalid name prefixes > -- > > Key: SPARK-40869 > URL: https://issues.apache.org/jira/browse/SPARK-40869 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Tobias Stadler >Assignee: Tobias Stadler >Priority: Major > > If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it > generates an invalid name prefix, e.g. `-name-0123456789abcdef`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628438#comment-17628438 ] Apache Spark commented on SPARK-40976: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38502 > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628429#comment-17628429 ] Apache Spark commented on SPARK-41001: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/38501 > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41002) Compatible `take`, `head` and `first` API in Python client
[ https://issues.apache.org/jira/browse/SPARK-41002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41002: - Summary: Compatible `take`, `head` and `first` API in Python client (was: Compatible `take` and `head` API in Python client ) > Compatible `take`, `head` and `first` API in Python client > --- > > Key: SPARK-41002 > URL: https://issues.apache.org/jira/browse/SPARK-41002 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628415#comment-17628415 ] Apache Spark commented on SPARK-41009: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38490 > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41009: Assignee: Apache Spark (was: Max Gekk) > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628414#comment-17628414 ] Apache Spark commented on SPARK-41009: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38490 > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41009: Assignee: Max Gekk (was: Apache Spark) > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
Max Gekk created SPARK-41009: Summary: Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 Key: SPARK-41009 URL: https://issues.apache.org/jira/browse/SPARK-41009 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} was: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} was: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {code} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {code} was: ``` import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # ``` > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: ``` import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # ``` was: {{```}} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {{```}} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > ``` > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {{```}} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {{```}} was: import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( { "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {{```}} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import IsotonicRegression as
[jira] [Created] (SPARK-41008) Isotonic regression result differs from sklearn implementation
Arne Koopman created SPARK-41008: Summary: Isotonic regression result differs from sklearn implementation Key: SPARK-41008 URL: https://issues.apache.org/jira/browse/SPARK-41008 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 3.3.1 Reporter: Arne Koopman import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( { "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41007: Assignee: Apache Spark > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Assignee: Apache Spark >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41007: Assignee: (was: Apache Spark) > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628384#comment-17628384 ] Apache Spark commented on SPARK-41007: -- User 'dfit99' has created a pull request for this issue: https://github.com/apache/spark/pull/38500 > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to serialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628372#comment-17628372 ] Daniel Fiterma commented on SPARK-41007: FYI: Have a fix for this already, going to push out a merge request soon. > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function # > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
Daniel Fiterma created SPARK-41007: -- Summary: BigInteger Serialization doesn't work with JavaBean Encoder Key: SPARK-41007 URL: https://issues.apache.org/jira/browse/SPARK-41007 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 3.3.1 Reporter: Daniel Fiterma When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function # -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40996: - Priority: Minor (was: Major) > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40996. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38481 [https://github.com/apache/spark/pull/38481] > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40996: Assignee: Yang Jie > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI
[ https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40834: --- Assignee: XiDuo You > Use SparkListenerSQLExecutionEnd to track final SQL status in UI > > > Key: SPARK-40834 > URL: https://issues.apache.org/jira/browse/SPARK-40834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > The SQL may succeed with some failed jobs. For example, a inner join with one > empty side and one large side, the plan would finish and the large side is > still running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI
[ https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40834. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38302 [https://github.com/apache/spark/pull/38302] > Use SparkListenerSQLExecutionEnd to track final SQL status in UI > > > Key: SPARK-40834 > URL: https://issues.apache.org/jira/browse/SPARK-40834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > The SQL may succeed with some failed jobs. For example, a inner join with one > empty side and one large side, the plan would finish and the large side is > still running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n audit-exporter-eee5073aac -w NAME READY STATUS RESTARTS AGE audit-exporter-71489e843d8085c0-driver 1/1 Running 0 9m54s audit-exporter-7e6b8b843d80b9e6-exec-1 1/1 Running 0 9m40s data-io-120204843d899567-driver 0/1 Terminating 0 1s data-io-120204843d899567-driver 0/1 Terminating 0 2s data-io-120204843d899567-driver 0/1 Terminating 0 3s data-io-120204843d899567-driver 0/1 Terminating 0 3s{code} The error is: {code:java} {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map. Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n audit-exporter-eee5073aac -w NAME READY STATUS RESTARTS AGE audit-exporter-71489e843d8085c0-driver 1/1 Running 0 9m54s audit-exporter-7e6b8b843d80b9e6-exec-1 1/1 Running 0 9m40s data-io-120204843d899567-driver 0/1 Terminating 0 1s data-io-120204843d899567-driver 0/1 Terminating 0 2s data-io-120204843d899567-driver 0/1 Terminating 0 3s data-io-120204843d899567-driver 0/1 Terminating 0 3s{code} The error is: {code:java} {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map. Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown
[jira] [Resolved] (SPARK-27339) Decimal up cast to higher scale fails while reading parquet to Dataset
[ https://issues.apache.org/jira/browse/SPARK-27339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-27339. -- Resolution: Duplicate I can't reproduce this in the latest Spark, and think it might have been resolved by https://issues.apache.org/jira/browse/SPARK-31750 > Decimal up cast to higher scale fails while reading parquet to Dataset > -- > > Key: SPARK-27339 > URL: https://issues.apache.org/jira/browse/SPARK-27339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.4.0 >Reporter: Bill Schneider >Priority: Major > > Given a parquet file with a decimal (38,4) field. One can read it into a > dataframe but fails to read/cast it to a dataset using a case class with > BigDecimal field. > {code:java} > import org.apache.spark.sql.{SaveMode, SparkSession} > object ReproduceSparkDecimalBug extends App{ > case class SimpleDecimal(value: BigDecimal) > val path = "/tmp/sparkTest" > val spark = SparkSession.builder().master("local").getOrCreate() > import spark.implicits._ > spark > .sql("SELECT CAST(10.12345 AS DECIMAL(38,4)) AS value ") > .write > .mode(SaveMode.Overwrite) > .parquet(path) > // works fine and the dataframe will have a decimal(38,4) > val df = spark.read.parquet(path) > df.printSchema() > df.show(1) > // will fail -> org.apache.spark.sql.AnalysisException: Cannot up cast > `value` from decimal(38,4) to decimal(38,18) as it may truncate > // 1. Why Spark sees scala BigDecimal as fixed (38,18)? > // 2. Up casting to higher scale should be allowed anyway > val ds = df.as[SimpleDecimal] > ds.printSchema() > spark.close() > } > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Cannot up cast `value` from > decimal(38,4) to decimal(38,18) as it may truncate > The type path of the target object is: > - field (class: "scala.math.BigDecimal", name: "value") > - root class: "ReproduceSparkDecimalBug.SimpleDecimal" > You can either add an explicit cast to the input data or choose a higher > precision type of the field in the target object; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2366) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2382) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2377) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at >
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n qa-topfive-python-spark-2-15d42ac3b9 NAME READY STATUS RESTARTS AGE data-io-c590a7843d47e206-driver 1/1 Terminating 0 2s qa-top-five-python-1667475391655-exec-1 1/1 Running 0 94s qa-topfive-python-spark-2-462c5d843d46e38b-driver 1/1 Running 0 119s {code} The error is: {code:java} {"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map. Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown Source)\n\tat java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)\n\tat java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source)\n\tat
[jira] [Created] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
Eric created SPARK-41006: Summary: ConfigMap has the same name when launching two pods on the same namespace Key: SPARK-41006 URL: https://issues.apache.org/jira/browse/SPARK-41006 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0, 3.2.0, 3.1.0 Reporter: Eric If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n qa-topfive-python-spark-2-15d42ac3b9 NAME READY STATUS RESTARTS AGE data-io-c590a7843d47e206-driver 1/1 Terminating 0 2s qa-top-five-python-1667475391655-exec-1 1/1 Running 0 94s qa-topfive-python-spark-2-462c5d843d46e38b-driver 1/1 Running 0 119s {code} The error is: {code:java} {"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map. Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: Apache Spark > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628265#comment-17628265 ] Apache Spark commented on SPARK-40769: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38498 > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628264#comment-17628264 ] Apache Spark commented on SPARK-40769: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38498 > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: Apache Spark > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: (was: Apache Spark) > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628261#comment-17628261 ] Apache Spark commented on SPARK-41005: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38468 > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41005: Assignee: Apache Spark > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628260#comment-17628260 ] Apache Spark commented on SPARK-41005: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38468 > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41005: Assignee: (was: Apache Spark) > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41005) Arrow based collect
Ruifeng Zheng created SPARK-41005: - Summary: Arrow based collect Key: SPARK-41005 URL: https://issues.apache.org/jira/browse/SPARK-41005 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628248#comment-17628248 ] Apache Spark commented on SPARK-40999: -- User 'fred-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38497 > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40999: Assignee: (was: Apache Spark) > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40999: Assignee: Apache Spark > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Assignee: Apache Spark >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628245#comment-17628245 ] Apache Spark commented on SPARK-40999: -- User 'fred-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38497 > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628239#comment-17628239 ] Nikhil Sharma commented on SPARK-40819: --- Thank you for sharing such good information. Very informative and effective post. [https://www.igmguru.com/digital-marketing-programming/react-native-training/] > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1, 3.2.3, 3.3.2 >Reporter: Alfred Davidson >Priority: Critical > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628234#comment-17628234 ] Apache Spark commented on SPARK-40708: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38496 > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628235#comment-17628235 ] Apache Spark commented on SPARK-40708: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38496 > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40708: Assignee: (was: Apache Spark) > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40708: Assignee: Apache Spark > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628225#comment-17628225 ] Apache Spark commented on SPARK-35531: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38495 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628220#comment-17628220 ] Apache Spark commented on SPARK-41004: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38494 > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41004: Assignee: Apache Spark > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41004: Assignee: (was: Apache Spark) > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41004) Check error classes in InterceptorRegistrySuite
BingKun Pan created SPARK-41004: --- Summary: Check error classes in InterceptorRegistrySuite Key: SPARK-41004 URL: https://issues.apache.org/jira/browse/SPARK-41004 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: BingKun Pan - CONNECT.INTERCEPTOR_CTOR_MISSING - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38270) SQL CLI AM should keep same exitcode with client
[ https://issues.apache.org/jira/browse/SPARK-38270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628179#comment-17628179 ] Apache Spark commented on SPARK-38270: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38492 > SQL CLI AM should keep same exitcode with client > > > Key: SPARK-38270 > URL: https://issues.apache.org/jira/browse/SPARK-38270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > > Currently for SQL CLI, we all use shutdown hook to stop SC > {code:java} > // Clean up after we exit > ShutdownHookManager.addShutdownHook { () => SparkSQLEnv.stop() } > {code} > This cause Yarn AM always success even client exit with code not 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40999: Fix Version/s: (was: 3.4.0) > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org