date:20221103

[jira] [Resolved] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41004.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38494
[https://github.com/apache/spark/pull/38494]

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41004:


Assignee: BingKun Pan

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2022-11-03 Thread Yilun Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628700#comment-17628700
 ] 

Yilun Fan commented on SPARK-33349:
---

I also met this problem in Spark 3.2.1, kubernetes-client 5.4.1.

 
{code:java}
ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is 
expected if the application is shutting down.) 
io.fabric8.kubernetes.client.WatcherException: too old resource version: 
63993943 (64057995) 
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103){code}
I think we have to add some retry in ExecutorPodsWatchSnapshotSource. 
Especially when we close spark.kubernetes.executor.enableApiPolling,  only this 
watcher can receive executor pod status.

Just like what spark has done in the submit client.  
[https://github.com/apache/spark/pull/29533/files] 

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40777) Use error classes for Protobuf exceptions

2022-11-03 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40777.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38344
[https://github.com/apache/spark/pull/38344]

> Use error classes for Protobuf exceptions
> -
>
> Key: SPARK-40777
> URL: https://issues.apache.org/jira/browse/SPARK-40777
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Assignee: Sandish Kumar HN
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use error classes for all the exceptions.
> A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972]
>  
> cc: [~sanysand...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40777) Use error classes for Protobuf exceptions

2022-11-03 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40777:


Assignee: Sandish Kumar HN

> Use error classes for Protobuf exceptions
> -
>
> Key: SPARK-40777
> URL: https://issues.apache.org/jira/browse/SPARK-40777
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Assignee: Sandish Kumar HN
>Priority: Major
>
> We should use error classes for all the exceptions.
> A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972]
>  
> cc: [~sanysand...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41012:


Assignee: (was: Apache Spark)

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41012:


Assignee: Apache Spark

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628675#comment-17628675
 ] 

Apache Spark commented on SPARK-41012:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38508

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628673#comment-17628673
 ] 

Haejoon Lee commented on SPARK-41012:
-

I'm working on it

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-41012:
---

 Summary: Rename _LEGACY_ERROR_TEMP_1022 to 
ORDER_BY_POS_OUT_OF_RANGE
 Key: SPARK-41012
 URL: https://issues.apache.org/jira/browse/SPARK-41012
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41011) Refine Sequence#checkInputDataTypes related DataTypeMismatch

2022-11-03 Thread Yang Jie (Jira)

Yang Jie created SPARK-41011:


 Summary: Refine Sequence#checkInputDataTypes related 
DataTypeMismatch
 Key: SPARK-41011
 URL: https://issues.apache.org/jira/browse/SPARK-41011
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40372:


Assignee: Apache Spark

> Migrate failures of array type checks onto error classes
> 
>
> Key: SPARK-40372
> URL: https://issues.apache.org/jira/browse/SPARK-40372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
> expressions:
> 1. SortArray (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
> 2. ArrayContains (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
> 3. ArrayPosition (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
> 4. ElementAt (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
> 5. Concat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
> 6. Flatten (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
> 7. Sequence (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
> 8. ArrayRemove (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
> 9. ArrayDistinct (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40372:


Assignee: (was: Apache Spark)

> Migrate failures of array type checks onto error classes
> 
>
> Key: SPARK-40372
> URL: https://issues.apache.org/jira/browse/SPARK-40372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
> expressions:
> 1. SortArray (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
> 2. ArrayContains (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
> 3. ArrayPosition (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
> 4. ElementAt (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
> 5. Concat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
> 6. Flatten (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
> 7. Sequence (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
> 8. ArrayRemove (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
> 9. ArrayDistinct (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41001:


Assignee: Martin Grund

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41001.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38501
[https://github.com/apache/spark/pull/38501]

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40976.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38502
[https://github.com/apache/spark/pull/38502]

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40976:


Assignee: Yang Jie

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628559#comment-17628559
 ] 

Apache Spark commented on SPARK-41010:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38506

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628557#comment-17628557
 ] 

Apache Spark commented on SPARK-41010:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38506

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41010:


Assignee: Apache Spark

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41010:


Assignee: (was: Apache Spark)

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628553#comment-17628553
 ] 

Apache Spark commented on SPARK-40622:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38505

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628552#comment-17628552
 ] 

Apache Spark commented on SPARK-40622:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38505

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40681) Update gson transitive dependency to 2.8.9 or later

2022-11-03 Thread Michael deLeon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628549#comment-17628549
 ] 

Michael deLeon commented on SPARK-40681:


Is there any update on when we might we this in a spark release ?

> Update gson transitive dependency to 2.8.9 or later
> ---
>
> Key: SPARK-40681
> URL: https://issues.apache.org/jira/browse/SPARK-40681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Andrew Kyle Purtell
>Priority: Minor
>
> Spark 3.3 currently ships with GSON 2.8.6 and this should be managed up to 
> 2.8.9 or later.
> Versions of GSON prior to 2.8.9 are subject to 
> [gson#1991|https://github.com/google/gson/pull/1991] , detected and reported 
> by several flavors of static vulnerability assessment tools, at a fairly high 
> score because it is a deserialization of untrusted data problem.
> This issue is not meant to imply any particular security problem in Spark 
> itself.
> {noformat}
> [INFO] org.apache.spark:spark-network-common_2.12:jar:3.3.2-SNAPSHOT
> [INFO] +- com.google.crypto.tink:tink:jar:1.6.1:compile
> [INFO] |  \- com.google.code.gson:gson:jar:2.8.6:compile
> {noformat}
> {noformat}
> [INFO] org.apache.spark:spark-hive_2.12:jar:3.3.2-SNAPSHOT
> [INFO] +- org.apache.hive:hive-exec:jar:core:2.3.9:compile
> [INFO] |  +- com.google.code.gson:gson:jar:2.2.4:compile
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628545#comment-17628545
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38504

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628546#comment-17628546
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38504

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Rui Wang (Jira)

Rui Wang created SPARK-41010:


 Summary: Complete Support for Except and Intersect in Python client
 Key: SPARK-41010
 URL: https://issues.apache.org/jira/browse/SPARK-41010
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40801) Upgrade Apache Commons Text to 1.10

2022-11-03 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40801:
-
Fix Version/s: 3.2.3

> Upgrade Apache Commons Text to 1.10
> ---
>
> Key: SPARK-40801
> URL: https://issues.apache.org/jira/browse/SPARK-40801
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
>
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40940:


Assignee: (was: Apache Spark)

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40940:


Assignee: Apache Spark

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Apache Spark
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Wei Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628472#comment-17628472
 ] 

Wei Liu commented on SPARK-40940:
-

PR in: 

https://github.com/apache/spark/pull/38503

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628474#comment-17628474
 ] 

Apache Spark commented on SPARK-40940:
--

User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/38503

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes

2022-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40869.
---
Fix Version/s: 3.3.2
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38331
[https://github.com/apache/spark/pull/38331]

> KubernetesConf.getResourceNamePrefix creates invalid name prefixes
> --
>
> Key: SPARK-40869
> URL: https://issues.apache.org/jira/browse/SPARK-40869
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Tobias Stadler
>Assignee: Tobias Stadler
>Priority: Major
> Fix For: 3.3.2, 3.2.3, 3.4.0
>
>
> If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it 
> generates an invalid name prefix, e.g. `-name-0123456789abcdef`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes

2022-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40869:
-

Assignee: Tobias Stadler

> KubernetesConf.getResourceNamePrefix creates invalid name prefixes
> --
>
> Key: SPARK-40869
> URL: https://issues.apache.org/jira/browse/SPARK-40869
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Tobias Stadler
>Assignee: Tobias Stadler
>Priority: Major
>
> If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it 
> generates an invalid name prefix, e.g. `-name-0123456789abcdef`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628438#comment-17628438
 ] 

Apache Spark commented on SPARK-40976:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38502

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628429#comment-17628429
 ] 

Apache Spark commented on SPARK-41001:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/38501

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41002) Compatible `take`, `head` and `first` API in Python client

2022-11-03 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41002:
-
Summary: Compatible `take`, `head` and `first` API in Python client   (was: 
Compatible `take` and `head` API in Python client )

> Compatible `take`, `head` and `first` API in Python client 
> ---
>
> Key: SPARK-41002
> URL: https://issues.apache.org/jira/browse/SPARK-41002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628415#comment-17628415
 ] 

Apache Spark commented on SPARK-41009:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38490

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41009:


Assignee: Apache Spark  (was: Max Gekk)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628414#comment-17628414
 ] 

Apache Spark commented on SPARK-41009:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38490

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41009:


Assignee: Max Gekk  (was: Apache Spark)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Max Gekk (Jira)

Max Gekk created SPARK-41009:


 Summary: Assign a name to the legacy error class 
_LEGACY_ERROR_TEMP_1070
 Key: SPARK-41009
 URL: https://issues.apache.org/jira/browse/SPARK-41009
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 

  was:
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 

  was:
{code:python}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{code}


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regre

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
{code:python}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{code}

  was:
```

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
```


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
```

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
```

  was:
 

{{```}}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{{```}}


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
> ```
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import Isot

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 

{{```}}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{{```}}

  was:
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    {
        "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
        "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],
        "weight": 1,
    }
)
# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)

# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {{```}}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as

[jira] [Created] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)

Arne Koopman created SPARK-41008:


 Summary: Isotonic regression result differs from sklearn 
implementation
 Key: SPARK-41008
 URL: https://issues.apache.org/jira/browse/SPARK-41008
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 3.3.1
Reporter: Arne Koopman


import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    {
        "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
        "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],
        "weight": 1,
    }
)
# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)

# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41007:


Assignee: Apache Spark

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Assignee: Apache Spark
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41007:


Assignee: (was: Apache Spark)

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628384#comment-17628384
 ] 

Apache Spark commented on SPARK-41007:
--

User 'dfit99' has created a pull request for this issue:
https://github.com/apache/spark/pull/38500

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to serialize the dataset, 
Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628372#comment-17628372
 ] 

Daniel Fiterma commented on SPARK-41007:


FYI: Have a fix for this already, going to push out a merge request soon. 

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function
 #  

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)

Daniel Fiterma created SPARK-41007:
--

 Summary: BigInteger Serialization doesn't work with JavaBean 
Encoder
 Key: SPARK-41007
 URL: https://issues.apache.org/jira/browse/SPARK-41007
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.3.1
Reporter: Daniel Fiterma


When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function
 #  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40996:
-
Priority: Minor  (was: Major)

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40996.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38481
[https://github.com/apache/spark/pull/38481]

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40996:


Assignee: Yang Jie

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-03 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40834:
---

Assignee: XiDuo You

> Use SparkListenerSQLExecutionEnd to track final SQL status in UI
> 
>
> Key: SPARK-40834
> URL: https://issues.apache.org/jira/browse/SPARK-40834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> The SQL may succeed with some failed jobs. For example, a inner join with one 
> empty side and one large side, the plan would finish and the large side is 
> still running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-03 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40834.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38302
[https://github.com/apache/spark/pull/38302]

> Use SparkListenerSQLExecutionEnd to track final SQL status in UI
> 
>
> Key: SPARK-40834
> URL: https://issues.apache.org/jira/browse/SPARK-40834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> The SQL may succeed with some failed jobs. For example, a inner join with one 
> empty side and one large side, the plan would finish and the large side is 
> still running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n audit-exporter-eee5073aac -w
NAME                                     READY   STATUS        RESTARTS   AGE
audit-exporter-71489e843d8085c0-driver   1/1     Running       0          9m54s
audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          9m40s
data-io-120204843d899567-driver          0/1     Terminating   0          1s
data-io-120204843d899567-driver          0/1     Terminating   0          2s
data-io-120204843d899567-driver          0/1     Terminating   0          3s
data-io-120204843d899567-driver          0/1     Terminating   0          
3s{code}
The error is:
{code:java}
{"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
 Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 
S

[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n audit-exporter-eee5073aac -w
NAME                                     READY   STATUS        RESTARTS   AGE
audit-exporter-71489e843d8085c0-driver   1/1     Running       0          9m54s
audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          9m40s
data-io-120204843d899567-driver          0/1     Terminating   0          1s
data-io-120204843d899567-driver          0/1     Terminating   0          2s
data-io-120204843d899567-driver          0/1     Terminating   0          3s
data-io-120204843d899567-driver          0/1     Terminating   0          
3s{code}
The error is:
{code:java}
{"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
 Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 
S

[jira] [Resolved] (SPARK-27339) Decimal up cast to higher scale fails while reading parquet to Dataset

2022-11-03 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27339.
--
Resolution: Duplicate

I can't reproduce this in the latest Spark, and think it might have been 
resolved by https://issues.apache.org/jira/browse/SPARK-31750

> Decimal up cast to higher scale fails while reading parquet to Dataset
> --
>
> Key: SPARK-27339
> URL: https://issues.apache.org/jira/browse/SPARK-27339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.4.0
>Reporter: Bill Schneider
>Priority: Major
>
> Given a parquet file with a decimal (38,4) field. One can read it into a 
> dataframe but fails to read/cast it to a dataset using a case class with 
> BigDecimal field. 
> {code:java}
> import org.apache.spark.sql.{SaveMode, SparkSession}
> object ReproduceSparkDecimalBug extends App{
>   case class SimpleDecimal(value: BigDecimal)
>   val path = "/tmp/sparkTest"
>   val spark = SparkSession.builder().master("local").getOrCreate()
>   import spark.implicits._
>   spark
> .sql("SELECT CAST(10.12345 AS DECIMAL(38,4)) AS value ")
> .write
> .mode(SaveMode.Overwrite)
> .parquet(path)
>   // works fine and the dataframe will have a decimal(38,4)
>   val df = spark.read.parquet(path)
>   df.printSchema()
>   df.show(1)
>   // will fail -> org.apache.spark.sql.AnalysisException: Cannot up cast 
> `value` from decimal(38,4) to decimal(38,18) as it may truncate
>   // 1. Why Spark sees scala BigDecimal as fixed (38,18)?
>   // 2. Up casting to higher scale should be allowed anyway
>   val ds = df.as[SimpleDecimal]
>   ds.printSchema()
>   spark.close()
> }
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot up cast `value` from 
> decimal(38,4) to decimal(38,18) as it may truncate
> The type path of the target object is:
> - field (class: "scala.math.BigDecimal", name: "value")
> - root class: "ReproduceSparkDecimalBug.SimpleDecimal"
> You can either add an explicit cast to the input data or choose a higher 
> precision type of the field in the target object;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2366)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2382)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:

[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n qa-topfive-python-spark-2-15d42ac3b9
NAME                                                READY   STATUS    RESTARTS  
 AGE
data-io-c590a7843d47e206-driver                     1/1     Terminating   0     
     2s
qa-top-five-python-1667475391655-exec-1             1/1     Running   0         
 94s
qa-topfive-python-spark-2-462c5d843d46e38b-driver   1/1     Running   0         
 119s {code}
The error is:
{code:java}
{"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map.
 Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 
Source)\n\tat java.base/java.util.stream.AbstractPipeline.copyInto(Unknown 
Source)\n\tat 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown 
Source)\n\tat 
java.

[jira] [Created] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)

Eric created SPARK-41006:


 Summary: ConfigMap has the same name when launching two pods on 
the same namespace
 Key: SPARK-41006
 URL: https://issues.apache.org/jira/browse/SPARK-41006
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0, 3.2.0, 3.1.0
Reporter: Eric


If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n qa-topfive-python-spark-2-15d42ac3b9
NAME                                                READY   STATUS    RESTARTS  
 AGE
data-io-c590a7843d47e206-driver                     1/1     Terminating   0     
     2s
qa-top-five-python-1667475391655-exec-1             1/1     Running   0         
 94s
qa-topfive-python-spark-2-462c5d843d46e38b-driver   1/1     Running   0         
 119s {code}
The error is:
{code:java}
{"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map.
 Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListS

[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: Apache Spark

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628265#comment-17628265
 ] 

Apache Spark commented on SPARK-40769:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38498

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: Apache Spark

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628264#comment-17628264
 ] 

Apache Spark commented on SPARK-40769:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38498

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: (was: Apache Spark)

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628261#comment-17628261
 ] 

Apache Spark commented on SPARK-41005:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38468

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41005:


Assignee: Apache Spark

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628260#comment-17628260
 ] 

Apache Spark commented on SPARK-41005:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38468

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41005:


Assignee: (was: Apache Spark)

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41005) Arrow based collect

2022-11-03 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41005:
-

 Summary: Arrow based collect
 Key: SPARK-41005
 URL: https://issues.apache.org/jira/browse/SPARK-41005
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628248#comment-17628248
 ] 

Apache Spark commented on SPARK-40999:
--

User 'fred-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38497

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40999:


Assignee: (was: Apache Spark)

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40999:


Assignee: Apache Spark

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Assignee: Apache Spark
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628245#comment-17628245
 ] 

Apache Spark commented on SPARK-40999:
--

User 'fred-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38497

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType

2022-11-03 Thread Nikhil Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628239#comment-17628239
 ] 

Nikhil Sharma commented on SPARK-40819:
---

Thank you for sharing such good information. Very informative and effective 
post. 

[https://www.igmguru.com/digital-marketing-programming/react-native-training/]

> Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type 
> instead of automatically converting to LongType 
> 
>
> Key: SPARK-40819
> URL: https://issues.apache.org/jira/browse/SPARK-40819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1, 3.2.3, 3.3.2
>Reporter: Alfred Davidson
>Priority: Critical
>
> Since 3.2 parquet files containing attributes with type "INT64 
> (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read 
> throws:
>  
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
> INT64 (TIMESTAMP(NANOS,true))
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
>  {code}
> Prior to 3.2 successfully reads the parquet automatically converting to a 
> LongType.
> I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 
> introduced the change in behaviour, more specifically here: 
> [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154]
>  which throws the QueryCompilationErrors.illegalParquetTypeError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628234#comment-17628234
 ] 

Apache Spark commented on SPARK-40708:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38496

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628235#comment-17628235
 ] 

Apache Spark commented on SPARK-40708:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38496

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40708:


Assignee: (was: Apache Spark)

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40708:


Assignee: Apache Spark

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628225#comment-17628225
 ] 

Apache Spark commented on SPARK-35531:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38495

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628220#comment-17628220
 ] 

Apache Spark commented on SPARK-41004:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38494

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41004:


Assignee: Apache Spark

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41004:


Assignee: (was: Apache Spark)

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-41004:
---

 Summary: Check error classes in InterceptorRegistrySuite
 Key: SPARK-41004
 URL: https://issues.apache.org/jira/browse/SPARK-41004
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: BingKun Pan


- CONNECT.INTERCEPTOR_CTOR_MISSING
 - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38270) SQL CLI AM should keep same exitcode with client

2022-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628179#comment-17628179
 ] 

Apache Spark commented on SPARK-38270:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38492

> SQL CLI AM should keep same exitcode with client
> 
>
> Key: SPARK-38270
> URL: https://issues.apache.org/jira/browse/SPARK-38270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently for SQL CLI, we all use  shutdown hook to stop SC
> {code:java}
> // Clean up after we exit
> ShutdownHookManager.addShutdownHook { () => SparkSQLEnv.stop() }
> {code}
> This cause Yarn AM always success even client exit with code not 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

91 matches

Mail list logo