[jira] [Resolved] (SPARK-40777) Use error classes for Protobuf exceptions

2022-11-03 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40777.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38344
[https://github.com/apache/spark/pull/38344]

> Use error classes for Protobuf exceptions
> -
>
> Key: SPARK-40777
> URL: https://issues.apache.org/jira/browse/SPARK-40777
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Assignee: Sandish Kumar HN
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use error classes for all the exceptions.
> A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972]
>  
> cc: [~sanysand...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40777) Use error classes for Protobuf exceptions

2022-11-03 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40777:


Assignee: Sandish Kumar HN

> Use error classes for Protobuf exceptions
> -
>
> Key: SPARK-40777
> URL: https://issues.apache.org/jira/browse/SPARK-40777
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Assignee: Sandish Kumar HN
>Priority: Major
>
> We should use error classes for all the exceptions.
> A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972]
>  
> cc: [~sanysand...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41012:


Assignee: (was: Apache Spark)

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41012:


Assignee: Apache Spark

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628675#comment-17628675
 ] 

Apache Spark commented on SPARK-41012:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38508

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628673#comment-17628673
 ] 

Haejoon Lee commented on SPARK-41012:
-

I'm working on it

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-03 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-41012:
---

 Summary: Rename _LEGACY_ERROR_TEMP_1022 to 
ORDER_BY_POS_OUT_OF_RANGE
 Key: SPARK-41012
 URL: https://issues.apache.org/jira/browse/SPARK-41012
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41011) Refine Sequence#checkInputDataTypes related DataTypeMismatch

2022-11-03 Thread Yang Jie (Jira)
Yang Jie created SPARK-41011:


 Summary: Refine Sequence#checkInputDataTypes related 
DataTypeMismatch
 Key: SPARK-41011
 URL: https://issues.apache.org/jira/browse/SPARK-41011
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40372:


Assignee: Apache Spark

> Migrate failures of array type checks onto error classes
> 
>
> Key: SPARK-40372
> URL: https://issues.apache.org/jira/browse/SPARK-40372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
> expressions:
> 1. SortArray (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
> 2. ArrayContains (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
> 3. ArrayPosition (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
> 4. ElementAt (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
> 5. Concat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
> 6. Flatten (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
> 7. Sequence (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
> 8. ArrayRemove (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
> 9. ArrayDistinct (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40372) Migrate failures of array type checks onto error classes

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40372:


Assignee: (was: Apache Spark)

> Migrate failures of array type checks onto error classes
> 
>
> Key: SPARK-40372
> URL: https://issues.apache.org/jira/browse/SPARK-40372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in collection 
> expressions:
> 1. SortArray (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1027-L1035
> 2. ArrayContains (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1259-L1264
> 3. ArrayPosition (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2035
> 4. ElementAt (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2177-L2187
> 5. Concat (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2385-L2388
> 6. Flatten (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2593-L2595
> 7. Sequence (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L2773
> 8. ArrayRemove (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3445-L3447
> 9. ArrayDistinct (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3642



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41001:


Assignee: Martin Grund

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41001.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38501
[https://github.com/apache/spark/pull/38501]

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40976.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38502
[https://github.com/apache/spark/pull/38502]

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40976:


Assignee: Yang Jie

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628559#comment-17628559
 ] 

Apache Spark commented on SPARK-41010:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38506

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628557#comment-17628557
 ] 

Apache Spark commented on SPARK-41010:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38506

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41010:


Assignee: Apache Spark

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41010:


Assignee: (was: Apache Spark)

> Complete Support for Except and Intersect in Python client
> --
>
> Key: SPARK-41010
> URL: https://issues.apache.org/jira/browse/SPARK-41010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628553#comment-17628553
 ] 

Apache Spark commented on SPARK-40622:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38505

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40622) Result of a single task in collect() must fit in 2GB

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628552#comment-17628552
 ] 

Apache Spark commented on SPARK-40622:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38505

> Result of a single task in collect() must fit in 2GB
> 
>
> Key: SPARK-40622
> URL: https://issues.apache.org/jira/browse/SPARK-40622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Ziqi Liu
>Priority: Major
>
> when collecting results, data from single partition/task is serialized 
> through byte array or ByteBuffer(which is backed by byte array as well), 
> therefore it's subject to java array max size limit(in terms of byte array, 
> it's 2GB).
>  
> Construct a single partition larger than 2GB and collect it can easily 
> reproduce the issue
> {code:java}
> // create data of size ~3GB in single partition, which exceeds the byte array 
> limit
> // random gen to make sure it's poorly compressed
> val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 100) 
> as data")
> withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
>   withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
> df.queryExecution.executedPlan.executeCollect()
>   }
> } {code}
>  will get a OOM error from 
> [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]
>  
> Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
> this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40681) Update gson transitive dependency to 2.8.9 or later

2022-11-03 Thread Michael deLeon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628549#comment-17628549
 ] 

Michael deLeon commented on SPARK-40681:


Is there any update on when we might we this in a spark release ?

> Update gson transitive dependency to 2.8.9 or later
> ---
>
> Key: SPARK-40681
> URL: https://issues.apache.org/jira/browse/SPARK-40681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Andrew Kyle Purtell
>Priority: Minor
>
> Spark 3.3 currently ships with GSON 2.8.6 and this should be managed up to 
> 2.8.9 or later.
> Versions of GSON prior to 2.8.9 are subject to 
> [gson#1991|https://github.com/google/gson/pull/1991] , detected and reported 
> by several flavors of static vulnerability assessment tools, at a fairly high 
> score because it is a deserialization of untrusted data problem.
> This issue is not meant to imply any particular security problem in Spark 
> itself.
> {noformat}
> [INFO] org.apache.spark:spark-network-common_2.12:jar:3.3.2-SNAPSHOT
> [INFO] +- com.google.crypto.tink:tink:jar:1.6.1:compile
> [INFO] |  \- com.google.code.gson:gson:jar:2.8.6:compile
> {noformat}
> {noformat}
> [INFO] org.apache.spark:spark-hive_2.12:jar:3.3.2-SNAPSHOT
> [INFO] +- org.apache.hive:hive-exec:jar:core:2.3.9:compile
> [INFO] |  +- com.google.code.gson:gson:jar:2.2.4:compile
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628545#comment-17628545
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38504

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628546#comment-17628546
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38504

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41010) Complete Support for Except and Intersect in Python client

2022-11-03 Thread Rui Wang (Jira)
Rui Wang created SPARK-41010:


 Summary: Complete Support for Except and Intersect in Python client
 Key: SPARK-41010
 URL: https://issues.apache.org/jira/browse/SPARK-41010
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40801) Upgrade Apache Commons Text to 1.10

2022-11-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40801:
-
Fix Version/s: 3.2.3

> Upgrade Apache Commons Text to 1.10
> ---
>
> Key: SPARK-40801
> URL: https://issues.apache.org/jira/browse/SPARK-40801
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
>
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40940:


Assignee: (was: Apache Spark)

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40940:


Assignee: Apache Spark

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Apache Spark
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628472#comment-17628472
 ] 

Wei Liu commented on SPARK-40940:
-

PR in: 

https://github.com/apache/spark/pull/38503

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40940) Fix the unsupported ops checker to allow chaining of stateful operators

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628474#comment-17628474
 ] 

Apache Spark commented on SPARK-40940:
--

User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/38503

> Fix the unsupported ops checker to allow chaining of stateful operators
> ---
>
> Key: SPARK-40940
> URL: https://issues.apache.org/jira/browse/SPARK-40940
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Priority: Major
>
> This is follow up ticket on https://issues.apache.org/jira/browse/SPARK-40925 
> - once we allow chaining of stateful operators in Spark SS, we need to fix 
> the unsupported ops checker to allow these (currently they are blocked and 
> require setting spark.sql.streaming.unsupportedOperationCheck to false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes

2022-11-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40869.
---
Fix Version/s: 3.3.2
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38331
[https://github.com/apache/spark/pull/38331]

> KubernetesConf.getResourceNamePrefix creates invalid name prefixes
> --
>
> Key: SPARK-40869
> URL: https://issues.apache.org/jira/browse/SPARK-40869
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Tobias Stadler
>Assignee: Tobias Stadler
>Priority: Major
> Fix For: 3.3.2, 3.2.3, 3.4.0
>
>
> If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it 
> generates an invalid name prefix, e.g. `-name-0123456789abcdef`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40869) KubernetesConf.getResourceNamePrefix creates invalid name prefixes

2022-11-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40869:
-

Assignee: Tobias Stadler

> KubernetesConf.getResourceNamePrefix creates invalid name prefixes
> --
>
> Key: SPARK-40869
> URL: https://issues.apache.org/jira/browse/SPARK-40869
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Tobias Stadler
>Assignee: Tobias Stadler
>Priority: Major
>
> If `KubernetesConf.getResourceNamePrefix` is called with e.g. `_name_`, it 
> generates an invalid name prefix, e.g. `-name-0123456789abcdef`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40976) Upgrade sbt to 1.7.3

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628438#comment-17628438
 ] 

Apache Spark commented on SPARK-40976:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38502

> Upgrade sbt to 1.7.3
> 
>
> Key: SPARK-40976
> URL: https://issues.apache.org/jira/browse/SPARK-40976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/sbt/sbt/releases/tag/v1.7.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41001) Connection string support for Python client

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628429#comment-17628429
 ] 

Apache Spark commented on SPARK-41001:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/38501

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41002) Compatible `take`, `head` and `first` API in Python client

2022-11-03 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41002:
-
Summary: Compatible `take`, `head` and `first` API in Python client   (was: 
Compatible `take` and `head` API in Python client )

> Compatible `take`, `head` and `first` API in Python client 
> ---
>
> Key: SPARK-41002
> URL: https://issues.apache.org/jira/browse/SPARK-41002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628415#comment-17628415
 ] 

Apache Spark commented on SPARK-41009:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38490

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41009:


Assignee: Apache Spark  (was: Max Gekk)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628414#comment-17628414
 ] 

Apache Spark commented on SPARK-41009:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38490

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41009:


Assignee: Max Gekk  (was: Apache Spark)

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
> ---
>
> Key: SPARK-41009
> URL: https://issues.apache.org/jira/browse/SPARK-41009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41009) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070

2022-11-03 Thread Max Gekk (Jira)
Max Gekk created SPARK-41009:


 Summary: Assign a name to the legacy error class 
_LEGACY_ERROR_TEMP_1070
 Key: SPARK-41009
 URL: https://issues.apache.org/jira/browse/SPARK-41009
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 

  was:
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import 

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 
{code:python}
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
"model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],     
    
"label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
"weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears. 
{code}
 

  was:
{code:python}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{code}


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from 

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
{code:python}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{code}

  was:
```

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
```


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from 

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
```

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
```

  was:
 

{{```}}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{{```}}


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
> ```
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import 

[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arne Koopman updated SPARK-41008:
-
Description: 
 

{{```}}

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark
 # The P(positives | model_score):
 # 0.6 -> 0.5 (1 out of the 2 labels is positive)
 # 0.333 -> 0.333 (1 out of the 3 labels is positive)
 # 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    \{         "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 
0.20],         "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         "weight": 1,     }
)

 # The fraction of positives for each of the distinct model_scores would be the 
best fit.
 # Resulting in the following expected calibrated model_scores:
 # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

 # The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

 # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

 # The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
 # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

 # The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

 # Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.
 # 
{{```}}

  was:
import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    {
        "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
        "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],
        "weight": 1,
    }
)
# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)

# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.


> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Major
>
>  
> {{```}}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 

[jira] [Created] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-11-03 Thread Arne Koopman (Jira)
Arne Koopman created SPARK-41008:


 Summary: Isotonic regression result differs from sklearn 
implementation
 Key: SPARK-41008
 URL: https://issues.apache.org/jira/browse/SPARK-41008
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 3.3.1
Reporter: Arne Koopman


import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as 
IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame(
    {
        "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
        "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],
        "weight": 1,
    }
)
# The fraction of positives for each of the distinct model_scores would be the 
best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], 
y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))

# >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = 
IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)

# >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy 
examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated 
model_scores generated by both implementations dissapears.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41007:


Assignee: Apache Spark

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Assignee: Apache Spark
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41007:


Assignee: (was: Apache Spark)

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628384#comment-17628384
 ] 

Apache Spark commented on SPARK-41007:
--

User 'dfit99' has created a pull request for this issue:
https://github.com/apache/spark/pull/38500

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to serialize the dataset, 
Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to serialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628372#comment-17628372
 ] 

Daniel Fiterma commented on SPARK-41007:


FYI: Have a fix for this already, going to push out a merge request soon. 

> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Fiterma updated SPARK-41007:
---
Description: 
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function

 

  was:
When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function
 #  

 


> BigInteger Serialization doesn't work with JavaBean Encoder
> ---
>
> Key: SPARK-41007
> URL: https://issues.apache.org/jira/browse/SPARK-41007
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.1
>Reporter: Daniel Fiterma
>Priority: Minor
>
> When creating a dataset using the [Java Bean 
> Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
>  with a bean that contains a field which is a {{java.math.BigInteger}} the 
> dataset will fail to serialize correctly. When trying to deserialize the 
> dataset, Spark throws the following error:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `bigInteger` from struct<> to decimal(38,18).
>  {code}
>  
>  
> Reproduction steps:
> Using the Java Dataset API:
>  # Create a Bean with a  {{java.math.BigInteger}} field
>  # Pass said Bean into the Java SparkSession {{createDataset}} function
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41007) BigInteger Serialization doesn't work with JavaBean Encoder

2022-11-03 Thread Daniel Fiterma (Jira)
Daniel Fiterma created SPARK-41007:
--

 Summary: BigInteger Serialization doesn't work with JavaBean 
Encoder
 Key: SPARK-41007
 URL: https://issues.apache.org/jira/browse/SPARK-41007
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.3.1
Reporter: Daniel Fiterma


When creating a dataset using the [Java Bean 
Encoder|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/Encoders.html#bean-java.lang.Class-]
 with a bean that contains a field which is a {{java.math.BigInteger}} the 
dataset will fail to serialize correctly. When trying to deserialize the 
dataset, Spark throws the following error:

 
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `bigInteger` from struct<> to decimal(38,18).
 {code}
 

 

Reproduction steps:

Using the Java Dataset API:
 # Create a Bean with a  {{java.math.BigInteger}} field
 # Pass said Bean into the Java SparkSession {{createDataset}} function
 #  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40996:
-
Priority: Minor  (was: Major)

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40996.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38481
[https://github.com/apache/spark/pull/38481]

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40996) Upgrade `sbt-checkstyle-plugin` to 4.0.0

2022-11-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40996:


Assignee: Yang Jie

> Upgrade `sbt-checkstyle-plugin` to 4.0.0
> 
>
> Key: SPARK-40996
> URL: https://issues.apache.org/jira/browse/SPARK-40996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> This is a precondition for upgrading sbt 1.7.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40834:
---

Assignee: XiDuo You

> Use SparkListenerSQLExecutionEnd to track final SQL status in UI
> 
>
> Key: SPARK-40834
> URL: https://issues.apache.org/jira/browse/SPARK-40834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> The SQL may succeed with some failed jobs. For example, a inner join with one 
> empty side and one large side, the plan would finish and the large side is 
> still running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40834) Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40834.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38302
[https://github.com/apache/spark/pull/38302]

> Use SparkListenerSQLExecutionEnd to track final SQL status in UI
> 
>
> Key: SPARK-40834
> URL: https://issues.apache.org/jira/browse/SPARK-40834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> The SQL may succeed with some failed jobs. For example, a inner join with one 
> empty side and one large side, the plan would finish and the large side is 
> still running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n audit-exporter-eee5073aac -w
NAME                                     READY   STATUS        RESTARTS   AGE
audit-exporter-71489e843d8085c0-driver   1/1     Running       0          9m54s
audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          9m40s
data-io-120204843d899567-driver          0/1     Terminating   0          1s
data-io-120204843d899567-driver          0/1     Terminating   0          2s
data-io-120204843d899567-driver          0/1     Terminating   0          3s
data-io-120204843d899567-driver          0/1     Terminating   0          
3s{code}
The error is:
{code:java}
{"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
 Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 

[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n audit-exporter-eee5073aac -w
NAME                                     READY   STATUS        RESTARTS   AGE
audit-exporter-71489e843d8085c0-driver   1/1     Running       0          9m54s
audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          9m40s
data-io-120204843d899567-driver          0/1     Terminating   0          1s
data-io-120204843d899567-driver          0/1     Terminating   0          2s
data-io-120204843d899567-driver          0/1     Terminating   0          3s
data-io-120204843d899567-driver          0/1     Terminating   0          
3s{code}
The error is:
{code:java}
{"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
 Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5012/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 

[jira] [Resolved] (SPARK-27339) Decimal up cast to higher scale fails while reading parquet to Dataset

2022-11-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27339.
--
Resolution: Duplicate

I can't reproduce this in the latest Spark, and think it might have been 
resolved by https://issues.apache.org/jira/browse/SPARK-31750

> Decimal up cast to higher scale fails while reading parquet to Dataset
> --
>
> Key: SPARK-27339
> URL: https://issues.apache.org/jira/browse/SPARK-27339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.4.0
>Reporter: Bill Schneider
>Priority: Major
>
> Given a parquet file with a decimal (38,4) field. One can read it into a 
> dataframe but fails to read/cast it to a dataset using a case class with 
> BigDecimal field. 
> {code:java}
> import org.apache.spark.sql.{SaveMode, SparkSession}
> object ReproduceSparkDecimalBug extends App{
>   case class SimpleDecimal(value: BigDecimal)
>   val path = "/tmp/sparkTest"
>   val spark = SparkSession.builder().master("local").getOrCreate()
>   import spark.implicits._
>   spark
> .sql("SELECT CAST(10.12345 AS DECIMAL(38,4)) AS value ")
> .write
> .mode(SaveMode.Overwrite)
> .parquet(path)
>   // works fine and the dataframe will have a decimal(38,4)
>   val df = spark.read.parquet(path)
>   df.printSchema()
>   df.show(1)
>   // will fail -> org.apache.spark.sql.AnalysisException: Cannot up cast 
> `value` from decimal(38,4) to decimal(38,18) as it may truncate
>   // 1. Why Spark sees scala BigDecimal as fixed (38,18)?
>   // 2. Up casting to higher scale should be allowed anyway
>   val ds = df.as[SimpleDecimal]
>   ds.printSchema()
>   spark.close()
> }
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot up cast `value` from 
> decimal(38,4) to decimal(38,18) as it may truncate
> The type path of the target object is:
> - field (class: "scala.math.BigDecimal", name: "value")
> - root class: "ReproduceSparkDecimalBug.SimpleDecimal"
> You can either add an explicit cast to the input data or choose a higher 
> precision type of the field in the target object;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2366)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2382)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$35$$anonfun$applyOrElse$15.applyOrElse(Analyzer.scala:2377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> 

[jira] [Updated] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric updated SPARK-41006:
-
Description: 
If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n qa-topfive-python-spark-2-15d42ac3b9
NAME                                                READY   STATUS    RESTARTS  
 AGE
data-io-c590a7843d47e206-driver                     1/1     Terminating   0     
     2s
qa-top-five-python-1667475391655-exec-1             1/1     Running   0         
 94s
qa-topfive-python-spark-2-462c5d843d46e38b-driver   1/1     Running   0         
 119s {code}
The error is:
{code:java}
{"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map.
 Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown 
Source)\n\tat java.base/java.util.stream.AbstractPipeline.copyInto(Unknown 
Source)\n\tat 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown 
Source)\n\tat 

[jira] [Created] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2022-11-03 Thread Eric (Jira)
Eric created SPARK-41006:


 Summary: ConfigMap has the same name when launching two pods on 
the same namespace
 Key: SPARK-41006
 URL: https://issues.apache.org/jira/browse/SPARK-41006
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0, 3.2.0, 3.1.0
Reporter: Eric


If we use the Spark Launcher to launch our spark apps in k8s:
{code:java}
val sparkLauncher = new InProcessLauncher()
 .setMaster(k8sMaster)
 .setDeployMode(deployMode)
 .setAppName(appName)
 .setVerbose(true)

sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
We have an issue when we launch another spark driver in the same namespace 
where other spark app was running:
{code:java}
kp -n qa-topfive-python-spark-2-15d42ac3b9
NAME                                                READY   STATUS    RESTARTS  
 AGE
data-io-c590a7843d47e206-driver                     1/1     Terminating   0     
     2s
qa-top-five-python-1667475391655-exec-1             1/1     Running   0         
 94s
qa-topfive-python-spark-2-462c5d843d46e38b-driver   1/1     Running   0         
 119s {code}
The error is:
{code:java}
{"time":"2022-10-24T15:08:50.239Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-44:
 'data-io'","msg":"Application failed with 
exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
 Failure executing: PUT at: 
https://kubernetes.default/api/v1/namespaces/qa-topfive-python-spark-2-edf723f942/configmaps/spark-drv-34c4e3840a0466c2-conf-map.
 Message: ConfigMap \"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: 
Forbidden: field is immutable when `immutable` is set. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: field 
is immutable when `immutable` is set, reason=FieldValueForbidden, 
additionalProperties={})], group=null, kind=ConfigMap, 
name=spark-drv-34c4e3840a0466c2-conf-map, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=ConfigMap 
\"spark-drv-34c4e3840a0466c2-conf-map\" is invalid: data: Forbidden: field is 
immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).\n\tat 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5663/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$5183/00.apply(Unknown
 Source)\n\tat 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
 
io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.createOrReplace(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:105)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.lambda$createOrReplace$7(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:174)\n\tat
 
io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl$$Lambda$5578/00.apply(Unknown
 Source)\n\tat java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown 
Source)\n\tat 

[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: Apache Spark

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628265#comment-17628265
 ] 

Apache Spark commented on SPARK-40769:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38498

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628264#comment-17628264
 ] 

Apache Spark commented on SPARK-40769:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38498

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: Apache Spark

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40769:


Assignee: (was: Apache Spark)

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628261#comment-17628261
 ] 

Apache Spark commented on SPARK-41005:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38468

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41005:


Assignee: Apache Spark

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628260#comment-17628260
 ] 

Apache Spark commented on SPARK-41005:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38468

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41005) Arrow based collect

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41005:


Assignee: (was: Apache Spark)

> Arrow based collect
> ---
>
> Key: SPARK-41005
> URL: https://issues.apache.org/jira/browse/SPARK-41005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41005) Arrow based collect

2022-11-03 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41005:
-

 Summary: Arrow based collect
 Key: SPARK-41005
 URL: https://issues.apache.org/jira/browse/SPARK-41005
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628248#comment-17628248
 ] 

Apache Spark commented on SPARK-40999:
--

User 'fred-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38497

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40999:


Assignee: (was: Apache Spark)

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40999:


Assignee: Apache Spark

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Assignee: Apache Spark
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628245#comment-17628245
 ] 

Apache Spark commented on SPARK-40999:
--

User 'fred-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38497

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType

2022-11-03 Thread Nikhil Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628239#comment-17628239
 ] 

Nikhil Sharma commented on SPARK-40819:
---

Thank you for sharing such good information. Very informative and effective 
post. 

[https://www.igmguru.com/digital-marketing-programming/react-native-training/]

> Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type 
> instead of automatically converting to LongType 
> 
>
> Key: SPARK-40819
> URL: https://issues.apache.org/jira/browse/SPARK-40819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1, 3.2.3, 3.3.2
>Reporter: Alfred Davidson
>Priority: Critical
>
> Since 3.2 parquet files containing attributes with type "INT64 
> (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read 
> throws:
>  
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
> INT64 (TIMESTAMP(NANOS,true))
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
>  {code}
> Prior to 3.2 successfully reads the parquet automatically converting to a 
> LongType.
> I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 
> introduced the change in behaviour, more specifically here: 
> [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154]
>  which throws the QueryCompilationErrors.illegalParquetTypeError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628234#comment-17628234
 ] 

Apache Spark commented on SPARK-40708:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38496

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628235#comment-17628235
 ] 

Apache Spark commented on SPARK-40708:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38496

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40708:


Assignee: (was: Apache Spark)

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40708) Auto update table statistics based on write metrics

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40708:


Assignee: Apache Spark

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628225#comment-17628225
 ] 

Apache Spark commented on SPARK-35531:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38495

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628220#comment-17628220
 ] 

Apache Spark commented on SPARK-41004:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38494

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41004:


Assignee: Apache Spark

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41004:


Assignee: (was: Apache Spark)

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-03 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-41004:
---

 Summary: Check error classes in InterceptorRegistrySuite
 Key: SPARK-41004
 URL: https://issues.apache.org/jira/browse/SPARK-41004
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: BingKun Pan


- CONNECT.INTERCEPTOR_CTOR_MISSING
 - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38270) SQL CLI AM should keep same exitcode with client

2022-11-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628179#comment-17628179
 ] 

Apache Spark commented on SPARK-38270:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38492

> SQL CLI AM should keep same exitcode with client
> 
>
> Key: SPARK-38270
> URL: https://issues.apache.org/jira/browse/SPARK-38270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently for SQL CLI, we all use  shutdown hook to stop SC
> {code:java}
> // Clean up after we exit
> ShutdownHookManager.addShutdownHook { () => SparkSQLEnv.stop() }
> {code}
> This cause Yarn AM always success even client exit with code not 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40999) Hints on subqueries are not properly propagated

2022-11-03 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40999:

Fix Version/s: (was: 3.4.0)

> Hints on subqueries are not properly propagated
> ---
>
> Key: SPARK-40999
> URL: https://issues.apache.org/jira/browse/SPARK-40999
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>Reporter: Fredrik Klauß
>Priority: Major
>
> Currently, if a user tries to specify a query like the following, the hints 
> on the subquery will be lost. 
> {code:java}
> SELECT * FROM target t WHERE EXISTS
> (SELECT /*+ BROADCAST */ * FROM source s WHERE s.key = t.key){code}
> This happens as hints are removed from the plan and pulled into joins in the 
> beginning of the optimization stage, but subqueries are only turned into 
> joins during optimization. As we remove any hints that are not below a join, 
> we end up removing hints that are below a subquery. 
>  
> To resolve this, we add a hint field to SubqueryExpression that any hints 
> inside a subquery's plan can be pulled into during EliminateResolvedHint, and 
> then pass this hint on when the subquery is turned into a join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org