[jira] [Resolved] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41004. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38494 [https://github.com/apache/spark/pull/38494] > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41004: Assignee: BingKun Pan > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
[ https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628700#comment-17628700 ] Yilun Fan commented on SPARK-33349: --- I also met this problem in Spark 3.2.1, kubernetes-client 5.4.1. {code:java} ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 63993943 (64057995) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103){code} I think we have to add some retry in ExecutorPodsWatchSnapshotSource. Especially when we close spark.kubernetes.executor.enableApiPolling, only this watcher can receive executor pod status. Just like what spark has done in the submit client. [https://github.com/apache/spark/pull/29533/files] > ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed > -- > > Key: SPARK-33349 > URL: https://issues.apache.org/jira/browse/SPARK-33349 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2, 3.1.0 >Reporter: Nicola Bova >Priority: Critical > > I launch my spark application with the > [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] > with the following yaml file: > {code:yaml} > apiVersion: sparkoperator.k8s.io/v1beta2 > kind: SparkApplication > metadata: > name: spark-kafka-streamer-test > namespace: kafka2hdfs > spec: > type: Scala > mode: cluster > image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0 > imagePullPolicy: Always > timeToLiveSeconds: 259200 > mainClass: path.to.my.class.KafkaStreamer > mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar > sparkVersion: 3.0.1 > restartPolicy: > type: Always > sparkConf: > "spark.kafka.consumer.cache.capacity": "8192" > "spark.kubernetes.memoryOverheadFactor": "0.3" > deps: > jars: > - my > - jar > - list > hadoopConfigMap: hdfs-config > driver: > cores: 4 > memory: 12g > labels: > version: 3.0.1 > serviceAccount: default > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > executor: > instances: 4 > cores: 4 > memory: 16g > labels: > version: 3.0.1 > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > {code} > I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart > the watcher when we receive a version changed from > k8s"|https://github.com/apache/spark/pull/29533] patch. > This is the driver log: > {code} > 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > ... // my app log, it's a structured streaming app reading from kafka and > writing to hdfs > 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed (this is expected if the application is shutting down.) > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 1574101276 (1574213896) > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > at > okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > {code} > The error above appears after roughly 50 minutes. > After the exception above, no more logs are produced and the app hangs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40777) Use error classes for Protobuf exceptions
[ https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40777. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38344 [https://github.com/apache/spark/pull/38344] > Use error classes for Protobuf exceptions > - > > Key: SPARK-40777 > URL: https://issues.apache.org/jira/browse/SPARK-40777 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Sandish Kumar HN >Priority: Major > Fix For: 3.4.0 > > > We should use error classes for all the exceptions. > A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972] > > cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40777) Use error classes for Protobuf exceptions
[ https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-40777: Assignee: Sandish Kumar HN > Use error classes for Protobuf exceptions > - > > Key: SPARK-40777 > URL: https://issues.apache.org/jira/browse/SPARK-40777 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Sandish Kumar HN >Priority: Major > > We should use error classes for all the exceptions. > A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972] > > cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41012: Assignee: (was: Apache Spark) > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41012: Assignee: Apache Spark > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628675#comment-17628675 ] Apache Spark commented on SPARK-41012: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/38508 > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628673#comment-17628673 ] Haejoon Lee commented on SPARK-41012: - I'm working on it > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
Haejoon Lee created SPARK-41012: --- Summary: Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE Key: SPARK-41012 URL: https://issues.apache.org/jira/browse/SPARK-41012 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41011) Refine Sequence#checkInputDataTypes related DataTypeMismatch
Yang Jie created SPARK-41011: Summary: Refine Sequence#checkInputDataTypes related DataTypeMismatch Key: SPARK-41011 URL: https://issues.apache.org/jira/browse/SPARK-41011 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40372: Assignee: Apache Spark > Migrate failures of array type checks onto error classes > > > Key: SPARK-40372 > URL: https://issues.apache.org/jira/browse/SPARK-40372 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in collection > expressions: > 1. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 > 2. ArrayContains (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 > 3. ArrayPosition (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 > 4. ElementAt (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 > 5. Concat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 > 6. Flatten (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 > 7. Sequence (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 > 8. ArrayRemove (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 > 9. ArrayDistinct (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40372: Assignee: (was: Apache Spark) > Migrate failures of array type checks onto error classes > > > Key: SPARK-40372 > URL: https://issues.apache.org/jira/browse/SPARK-40372 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in collection > expressions: > 1. SortArray (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035 > 2. ArrayContains (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264 > 3. ArrayPosition (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035 > 4. ElementAt (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187 > 5. Concat (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388 > 6. Flatten (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595 > 7. Sequence (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773 > 8. ArrayRemove (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447 > 9. ArrayDistinct (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41001: Assignee: Martin Grund > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41001. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38501 [https://github.com/apache/spark/pull/38501] > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40976. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38502 [https://github.com/apache/spark/pull/38502] > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40976: Assignee: Yang Jie > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628559#comment-17628559 ] Apache Spark commented on SPARK-41010: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38506 > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628557#comment-17628557 ] Apache Spark commented on SPARK-41010: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38506 > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41010: Assignee: Apache Spark > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client
[ https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41010: Assignee: (was: Apache Spark) > Complete Support for Except and Intersect in Python client > -- > > Key: SPARK-41010 > URL: https://issues.apache.org/jira/browse/SPARK-41010 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB
[ https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628553#comment-17628553 ] Apache Spark commented on SPARK-40622: -- User 'liuzqt' has created a pull request for this issue: https://github.com/apache/spark/pull/38505 > Result of a single task in collect() must fit in 2GB > > > Key: SPARK-40622 > URL: https://issues.apache.org/jira/browse/SPARK-40622 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Ziqi Liu >Priority: Major > > when collecting results, data from single partition/task is serialized > through byte array or ByteBuffer(which is backed by byte array as well), > therefore it's subject to java array max size limit(in terms of byte array, > it's 2GB). > > Construct a single partition larger than 2GB and collect it can easily > reproduce the issue > {code:java} > // create data of size ~3GB in single partition, which exceeds the byte array > limit > // random gen to make sure it's poorly compressed > val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) > as data") > withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") { > withSQLConf("spark.sql.useChunkedBuffer" -> "true") { > df.queryExecution.executedPlan.executeCollect() > } > } {code} > will get a OOM error from > [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125] > > Consider using ChunkedByteBuffer to replace byte array in order to bypassing > this limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB
[ https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628552#comment-17628552 ] Apache Spark commented on SPARK-40622: -- User 'liuzqt' has created a pull request for this issue: https://github.com/apache/spark/pull/38505 > Result of a single task in collect() must fit in 2GB > > > Key: SPARK-40622 > URL: https://issues.apache.org/jira/browse/SPARK-40622 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Ziqi Liu >Priority: Major > > when collecting results, data from single partition/task is serialized > through byte array or ByteBuffer(which is backed by byte array as well), > therefore it's subject to java array max size limit(in terms of byte array, > it's 2GB). > > Construct a single partition larger than 2GB and collect it can easily > reproduce the issue > {code:java} > // create data of size ~3GB in single partition, which exceeds the byte array > limit > // random gen to make sure it's poorly compressed > val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) > as data") > withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") { > withSQLConf("spark.sql.useChunkedBuffer" -> "true") { > df.queryExecution.executedPlan.executeCollect() > } > } {code} > will get a OOM error from > [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125] > > Consider using ChunkedByteBuffer to replace byte array in order to bypassing > this limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40681) Update gson transitive dependency to 2.8.9 or later
[ https://issues.apache.org/jira/browse/SPARK-40681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628549#comment-17628549 ] Michael deLeon commented on SPARK-40681: Is there any update on when we might we this in a spark release ? > Update gson transitive dependency to 2.8.9 or later > --- > > Key: SPARK-40681 > URL: https://issues.apache.org/jira/browse/SPARK-40681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Andrew Kyle Purtell >Priority: Minor > > Spark 3.3 currently ships with GSON 2.8.6 and this should be managed up to > 2.8.9 or later. > Versions of GSON prior to 2.8.9 are subject to > [gson#1991|https://github.com/google/gson/pull/1991] , detected and reported > by several flavors of static vulnerability assessment tools, at a fairly high > score because it is a deserialization of untrusted data problem. > This issue is not meant to imply any particular security problem in Spark > itself. > {noformat} > [INFO] org.apache.spark:spark-network-common_2.12:jar:3.3.2-SNAPSHOT > [INFO] +- com.google.crypto.tink:tink:jar:1.6.1:compile > [INFO] | \- com.google.code.gson:gson:jar:2.8.6:compile > {noformat} > {noformat} > [INFO] org.apache.spark:spark-hive_2.12:jar:3.3.2-SNAPSHOT > [INFO] +- org.apache.hive:hive-exec:jar:core:2.3.9:compile > [INFO] | +- com.google.code.gson:gson:jar:2.2.4:compile > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits
[ https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628545#comment-17628545 ] Apache Spark commented on SPARK-40815: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/38504 > SymlinkTextInputFormat returns incorrect result due to enabled > spark.hadoopRDD.ignoreEmptySplits > > > Key: SPARK-40815 > URL: https://issues.apache.org/jira/browse/SPARK-40815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits
[ https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628546#comment-17628546 ] Apache Spark commented on SPARK-40815: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/38504 > SymlinkTextInputFormat returns incorrect result due to enabled > spark.hadoopRDD.ignoreEmptySplits > > > Key: SPARK-40815 > URL: https://issues.apache.org/jira/browse/SPARK-40815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41010) Complete Support for Except and Intersect in Python client
Rui Wang created SPARK-41010: Summary: Complete Support for Except and Intersect in Python client Key: SPARK-41010 URL: https://issues.apache.org/jira/browse/SPARK-41010 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40801) Upgrade Apache Commons Text to 1.10
[ https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40801: - Fix Version/s: 3.2.3 > Upgrade Apache Commons Text to 1.10 > --- > > Key: SPARK-40801 > URL: https://issues.apache.org/jira/browse/SPARK-40801 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0, 3.2.3, 3.3.2 > > > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40940: Assignee: (was: Apache Spark) > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40940: Assignee: Apache Spark > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Assignee: Apache Spark >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628472#comment-17628472 ] Wei Liu commented on SPARK-40940: - PR in: https://github.com/apache/spark/pull/38503 > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators
[ https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628474#comment-17628474 ] Apache Spark commented on SPARK-40940: -- User 'WweiL' has created a pull request for this issue: https://github.com/apache/spark/pull/38503 > Fix the unsupported ops checker to allow chaining of stateful operators > --- > > Key: SPARK-40940 > URL: https://issues.apache.org/jira/browse/SPARK-40940 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Priority: Major > > This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 > - once we allow chaining of stateful operators in Spark SS, we need to fix > the unsupported ops checker to allow these (currently they are blocked and > require setting spark.sql.streaming.unsupportedOperationCheck to false -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[ https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40869. --- Fix Version/s: 3.3.2 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 38331 [https://github.com/apache/spark/pull/38331] > KubernetesConf.getResourceNamePrefix creates invalid name prefixes > -- > > Key: SPARK-40869 > URL: https://issues.apache.org/jira/browse/SPARK-40869 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Tobias Stadler >Assignee: Tobias Stadler >Priority: Major > Fix For: 3.3.2, 3.2.3, 3.4.0 > > > If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it > generates an invalid name prefix, e.g. `-name-0123456789abcdef`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[ https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40869: - Assignee: Tobias Stadler > KubernetesConf.getResourceNamePrefix creates invalid name prefixes > -- > > Key: SPARK-40869 > URL: https://issues.apache.org/jira/browse/SPARK-40869 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Tobias Stadler >Assignee: Tobias Stadler >Priority: Major > > If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it > generates an invalid name prefix, e.g. `-name-0123456789abcdef`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3
[ https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628438#comment-17628438 ] Apache Spark commented on SPARK-40976: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38502 > Upgrade sbt to 1.7.3 > > > Key: SPARK-40976 > URL: https://issues.apache.org/jira/browse/SPARK-40976 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/sbt/sbt/releases/tag/v1.7.3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41001) Connection string support for Python client
[ https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628429#comment-17628429 ] Apache Spark commented on SPARK-41001: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/38501 > Connection string support for Python client > --- > > Key: SPARK-41001 > URL: https://issues.apache.org/jira/browse/SPARK-41001 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41002) Compatible `take`, `head` and `first` API in Python client
[ https://issues.apache.org/jira/browse/SPARK-41002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41002: - Summary: Compatible `take`, `head` and `first` API in Python client (was: Compatible `take` and `head` API in Python client ) > Compatible `take`, `head` and `first` API in Python client > --- > > Key: SPARK-41002 > URL: https://issues.apache.org/jira/browse/SPARK-41002 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628415#comment-17628415 ] Apache Spark commented on SPARK-41009: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38490 > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41009: Assignee: Apache Spark (was: Max Gekk) > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628414#comment-17628414 ] Apache Spark commented on SPARK-41009: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38490 > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
[ https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41009: Assignee: Max Gekk (was: Apache Spark) > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 > --- > > Key: SPARK-41009 > URL: https://issues.apache.org/jira/browse/SPARK-41009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
Max Gekk created SPARK-41009: Summary: Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070 Key: SPARK-41009 URL: https://issues.apache.org/jira/browse/SPARK-41009 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} was: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame({ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. {code} was: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {code} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regre
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {code:python} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {code} was: ``` import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # ``` > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > {code:python} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: ``` import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # ``` was: {{```}} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {{```}} > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > ``` > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import Isot
[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation
[ https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arne Koopman updated SPARK-41008: - Description: {{```}} import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( \{ "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. # {{```}} was: import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( { "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. > Isotonic regression result differs from sklearn implementation > -- > > Key: SPARK-41008 > URL: https://issues.apache.org/jira/browse/SPARK-41008 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.3.1 >Reporter: Arne Koopman >Priority: Major > > > {{```}} > import pandas as pd > from pyspark.sql.types import DoubleType > from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn > from pyspark.ml.regression import IsotonicRegression as
[jira] [Created] (SPARK-41008) Isotonic regression result differs from sklearn implementation
Arne Koopman created SPARK-41008: Summary: Isotonic regression result differs from sklearn implementation Key: SPARK-41008 URL: https://issues.apache.org/jira/browse/SPARK-41008 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 3.3.1 Reporter: Arne Koopman import pandas as pd from pyspark.sql.types import DoubleType from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark # The P(positives | model_score): # 0.6 -> 0.5 (1 out of the 2 labels is positive) # 0.333 -> 0.333 (1 out of the 3 labels is positive) # 0.20 -> 0.25 (1 out of the 4 labels is positive) tc_pd = pd.DataFrame( { "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], "label": [1, 0, 0, 1, 0, 1, 0, 0, 0], "weight": 1, } ) # The fraction of positives for each of the distinct model_scores would be the best fit. # Resulting in the following expected calibrated model_scores: # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25] # The sklearn implementation of Isotonic Regression. from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight']) print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score'])) # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ] # The pyspark implementation of Isotonic Regression. tc_df = spark.createDataFrame(tc_pd) tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType())) isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight') tc_model = isotonic_regressor_pyspark.fit(tc_df) tc_pd = tc_model.transform(tc_df).toPandas() print("pyspark:", tc_pd['prediction'].values) # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ] # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41007: Assignee: Apache Spark > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Assignee: Apache Spark >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41007: Assignee: (was: Apache Spark) > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628384#comment-17628384 ] Apache Spark commented on SPARK-41007: -- User 'dfit99' has created a pull request for this issue: https://github.com/apache/spark/pull/38500 > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to serialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to serialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628372#comment-17628372 ] Daniel Fiterma commented on SPARK-41007: FYI: Have a fix for this already, going to push out a merge request soon. > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
[ https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Fiterma updated SPARK-41007: --- Description: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function was: When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function # > BigInteger Serialization doesn't work with JavaBean Encoder > --- > > Key: SPARK-41007 > URL: https://issues.apache.org/jira/browse/SPARK-41007 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.1 >Reporter: Daniel Fiterma >Priority: Minor > > When creating a dataset using the [Java Bean > Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] > with a bean that contains a field which is a {{java.math.BigInteger}} the > dataset will fail to serialize correctly. When trying to deserialize the > dataset, Spark throws the following error: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `bigInteger` from struct<> to decimal(38,18). > {code} > > > Reproduction steps: > Using the Java Dataset API: > # Create a Bean with a {{java.math.BigInteger}} field > # Pass said Bean into the Java SparkSession {{createDataset}} function > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder
Daniel Fiterma created SPARK-41007: -- Summary: BigInteger Serialization doesn't work with JavaBean Encoder Key: SPARK-41007 URL: https://issues.apache.org/jira/browse/SPARK-41007 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 3.3.1 Reporter: Daniel Fiterma When creating a dataset using the [Java Bean Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-] with a bean that contains a field which is a {{java.math.BigInteger}} the dataset will fail to serialize correctly. When trying to deserialize the dataset, Spark throws the following error: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `bigInteger` from struct<> to decimal(38,18). {code} Reproduction steps: Using the Java Dataset API: # Create a Bean with a {{java.math.BigInteger}} field # Pass said Bean into the Java SparkSession {{createDataset}} function # -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40996: - Priority: Minor (was: Major) > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40996. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38481 [https://github.com/apache/spark/pull/38481] > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40996: Assignee: Yang Jie > Upgrade `sbt-checkstyle-plugin` to 4.0.0 > > > Key: SPARK-40996 > URL: https://issues.apache.org/jira/browse/SPARK-40996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > This is a precondition for upgrading sbt 1.7.3 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI
[ https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40834: --- Assignee: XiDuo You > Use SparkListenerSQLExecutionEnd to track final SQL status in UI > > > Key: SPARK-40834 > URL: https://issues.apache.org/jira/browse/SPARK-40834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > The SQL may succeed with some failed jobs. For example, a inner join with one > empty side and one large side, the plan would finish and the large side is > still running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI
[ https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40834. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38302 [https://github.com/apache/spark/pull/38302] > Use SparkListenerSQLExecutionEnd to track final SQL status in UI > > > Key: SPARK-40834 > URL: https://issues.apache.org/jira/browse/SPARK-40834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > The SQL may succeed with some failed jobs. For example, a inner join with one > empty side and one large side, the plan would finish and the large side is > still running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n audit-exporter-eee5073aac -w NAME READY STATUS RESTARTS AGE audit-exporter-71489e843d8085c0-driver 1/1 Running 0 9m54s audit-exporter-7e6b8b843d80b9e6-exec-1 1/1 Running 0 9m40s data-io-120204843d899567-driver 0/1 Terminating 0 1s data-io-120204843d899567-driver 0/1 Terminating 0 2s data-io-120204843d899567-driver 0/1 Terminating 0 3s data-io-120204843d899567-driver 0/1 Terminating 0 3s{code} The error is: {code:java} {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map. Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown S
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n audit-exporter-eee5073aac -w NAME READY STATUS RESTARTS AGE audit-exporter-71489e843d8085c0-driver 1/1 Running 0 9m54s audit-exporter-7e6b8b843d80b9e6-exec-1 1/1 Running 0 9m40s data-io-120204843d899567-driver 0/1 Terminating 0 1s data-io-120204843d899567-driver 0/1 Terminating 0 2s data-io-120204843d899567-driver 0/1 Terminating 0 3s data-io-120204843d899567-driver 0/1 Terminating 0 3s{code} The error is: {code:java} {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map. Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown S
[jira] [Resolved] (SPARK-27339) Decimal up cast to higher scale fails while reading parquet to Dataset
[ https://issues.apache.org/jira/browse/SPARK-27339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-27339. -- Resolution: Duplicate I can't reproduce this in the latest Spark, and think it might have been resolved by https://issues.apache.org/jira/browse/SPARK-31750 > Decimal up cast to higher scale fails while reading parquet to Dataset > -- > > Key: SPARK-27339 > URL: https://issues.apache.org/jira/browse/SPARK-27339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.4.0 >Reporter: Bill Schneider >Priority: Major > > Given a parquet file with a decimal (38,4) field. One can read it into a > dataframe but fails to read/cast it to a dataset using a case class with > BigDecimal field. > {code:java} > import org.apache.spark.sql.{SaveMode, SparkSession} > object ReproduceSparkDecimalBug extends App{ > case class SimpleDecimal(value: BigDecimal) > val path = "/tmp/sparkTest" > val spark = SparkSession.builder().master("local").getOrCreate() > import spark.implicits._ > spark > .sql("SELECT CAST(10.12345 AS DECIMAL(38,4)) AS value ") > .write > .mode(SaveMode.Overwrite) > .parquet(path) > // works fine and the dataframe will have a decimal(38,4) > val df = spark.read.parquet(path) > df.printSchema() > df.show(1) > // will fail -> org.apache.spark.sql.AnalysisException: Cannot up cast > `value` from decimal(38,4) to decimal(38,18) as it may truncate > // 1. Why Spark sees scala BigDecimal as fixed (38,18)? > // 2. Up casting to higher scale should be allowed anyway > val ds = df.as[SimpleDecimal] > ds.printSchema() > spark.close() > } > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Cannot up cast `value` from > decimal(38,4) to decimal(38,18) as it may truncate > The type path of the target object is: > - field (class: "scala.math.BigDecimal", name: "value") > - root class: "ReproduceSparkDecimalBug.SimpleDecimal" > You can either add an explicit cast to the input data or choose a higher > precision type of the field in the target object; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2366) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2382) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2377) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:
[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
[ https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric updated SPARK-41006: - Description: If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n qa-topfive-python-spark-2-15d42ac3b9 NAME READY STATUS RESTARTS AGE data-io-c590a7843d47e206-driver 1/1 Terminating 0 2s qa-top-five-python-1667475391655-exec-1 1/1 Running 0 94s qa-topfive-python-spark-2-462c5d843d46e38b-driver 1/1 Running 0 119s {code} The error is: {code:java} {"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map. Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown Source)\n\tat java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)\n\tat java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source)\n\tat java.
[jira] [Created] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace
Eric created SPARK-41006: Summary: ConfigMap has the same name when launching two pods on the same namespace Key: SPARK-41006 URL: https://issues.apache.org/jira/browse/SPARK-41006 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0, 3.2.0, 3.1.0 Reporter: Eric If we use the Spark Launcher to launch our spark apps in k8s: {code:java} val sparkLauncher = new InProcessLauncher() .setMaster(k8sMaster) .setDeployMode(deployMode) .setAppName(appName) .setVerbose(true) sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code} We have an issue when we launch another spark driver in the same namespace where other spark app was running: {code:java} kp -n qa-topfive-python-spark-2-15d42ac3b9 NAME READY STATUS RESTARTS AGE data-io-c590a7843d47e206-driver 1/1 Terminating 0 2s qa-top-five-python-1667475391655-exec-1 1/1 Running 0 94s qa-topfive-python-spark-2-462c5d843d46e38b-driver 1/1 Running 0 119s {code} The error is: {code:java} {"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44: 'data-io'","msg":"Application failed with exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map. Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field is immutable when `immutable` is set, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=ConfigMap, name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is immutable when `immutable` is set, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown Source)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source)\n\tat java.base/java.util.ArrayList$ArrayListS
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: Apache Spark > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628265#comment-17628265 ] Apache Spark commented on SPARK-40769: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38498 > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: Apache Spark > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628264#comment-17628264 ] Apache Spark commented on SPARK-40769: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38498 > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40769: Assignee: (was: Apache Spark) > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628261#comment-17628261 ] Apache Spark commented on SPARK-41005: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38468 > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41005: Assignee: Apache Spark > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628260#comment-17628260 ] Apache Spark commented on SPARK-41005: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38468 > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41005) Arrow based collect
[ https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41005: Assignee: (was: Apache Spark) > Arrow based collect > --- > > Key: SPARK-41005 > URL: https://issues.apache.org/jira/browse/SPARK-41005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41005) Arrow based collect
Ruifeng Zheng created SPARK-41005: - Summary: Arrow based collect Key: SPARK-41005 URL: https://issues.apache.org/jira/browse/SPARK-41005 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628248#comment-17628248 ] Apache Spark commented on SPARK-40999: -- User 'fred-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38497 > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40999: Assignee: (was: Apache Spark) > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40999: Assignee: Apache Spark > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Assignee: Apache Spark >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated
[ https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628245#comment-17628245 ] Apache Spark commented on SPARK-40999: -- User 'fred-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38497 > Hints on subqueries are not properly propagated > --- > > Key: SPARK-40999 > URL: https://issues.apache.org/jira/browse/SPARK-40999 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1 >Reporter: Fredrik Klauß >Priority: Major > > Currently, if a user tries to specify a query like the following, the hints > on the subquery will be lost. > {code:java} > SELECT * FROM target t WHERE EXISTS > (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code} > This happens as hints are removed from the plan and pulled into joins in the > beginning of the optimization stage, but subqueries are only turned into > joins during optimization. As we remove any hints that are not below a join, > we end up removing hints that are below a subquery. > > To resolve this, we add a hint field to SubqueryExpression that any hints > inside a subquery's plan can be pulled into during EliminateResolvedHint, and > then pass this hint on when the subquery is turned into a join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628239#comment-17628239 ] Nikhil Sharma commented on SPARK-40819: --- Thank you for sharing such good information. Very informative and effective post. [https://www.igmguru.com/digital-marketing-programming/react-native-training/] > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1, 3.2.3, 3.3.2 >Reporter: Alfred Davidson >Priority: Critical > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628234#comment-17628234 ] Apache Spark commented on SPARK-40708: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38496 > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628235#comment-17628235 ] Apache Spark commented on SPARK-40708: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38496 > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40708: Assignee: (was: Apache Spark) > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40708: Assignee: Apache Spark > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628225#comment-17628225 ] Apache Spark commented on SPARK-35531: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38495 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628220#comment-17628220 ] Apache Spark commented on SPARK-41004: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38494 > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41004: Assignee: Apache Spark > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41004: Assignee: (was: Apache Spark) > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41004) Check error classes in InterceptorRegistrySuite
BingKun Pan created SPARK-41004: --- Summary: Check error classes in InterceptorRegistrySuite Key: SPARK-41004 URL: https://issues.apache.org/jira/browse/SPARK-41004 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: BingKun Pan - CONNECT.INTERCEPTOR_CTOR_MISSING - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38270) SQL CLI AM should keep same exitcode with client
[ https://issues.apache.org/jira/browse/SPARK-38270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628179#comment-17628179 ] Apache Spark commented on SPARK-38270: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38492 > SQL CLI AM should keep same exitcode with client > > > Key: SPARK-38270 > URL: https://issues.apache.org/jira/browse/SPARK-38270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > > Currently for SQL CLI, we all use shutdown hook to stop SC > {code:java} > // Clean up after we exit > ShutdownHookManager.addShutdownHook { () => SparkSQLEnv.stop() } > {code} > This cause Yarn AM always success even client exit with code not 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org