date:20240509

[jira] [Updated] (SPARK-48212) Fully enable PandasUDFParityTests. test_udf_wrong_arg

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48212:
---
Labels: pull-request-available  (was: )

> Fully enable PandasUDFParityTests. test_udf_wrong_arg
> -
>
> Key: SPARK-48212
> URL: https://issues.apache.org/jira/browse/SPARK-48212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48213) Do not push down filter if non-cheap expression exceed reused limit

2024-05-09 Thread Mingliang Zhu (Jira)

Mingliang Zhu created SPARK-48213:
-

 Summary: Do not push down filter if non-cheap expression exceed 
reused limit
 Key: SPARK-48213
 URL: https://issues.apache.org/jira/browse/SPARK-48213
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Mingliang Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48213) Do not push down predicate if non-cheap expression exceed reused limit

2024-05-09 Thread Mingliang Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Zhu updated SPARK-48213:
--
Summary: Do not push down predicate if non-cheap expression exceed reused 
limit  (was: Do not push down filter if non-cheap expression exceed reused 
limit)

> Do not push down predicate if non-cheap expression exceed reused limit
> --
>
> Key: SPARK-48213
> URL: https://issues.apache.org/jira/browse/SPARK-48213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48186) Add support for AbstractMapType

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48186.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46458
[https://github.com/apache/spark/pull/46458]

> Add support for AbstractMapType
> ---
>
> Key: SPARK-48186
> URL: https://issues.apache.org/jira/browse/SPARK-48186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48186) Add support for AbstractMapType

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48186:
---

Assignee: Uroš Bojanić

> Add support for AbstractMapType
> ---
>
> Key: SPARK-48186
> URL: https://issues.apache.org/jira/browse/SPARK-48186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48213) Do not push down predicate if non-cheap expression exceed reused limit

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48213:
---
Labels: pull-request-available  (was: )

> Do not push down predicate if non-cheap expression exceed reused limit
> --
>
> Key: SPARK-48213
> URL: https://issues.apache.org/jira/browse/SPARK-48213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47354) Variant expressions (all collations)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47354:
---

Assignee: Uroš Bojanić

> Variant expressions (all collations)
> 
>
> Key: SPARK-47354
> URL: https://issues.apache.org/jira/browse/SPARK-47354
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47354) Variant expressions (all collations)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47354.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46424
[https://github.com/apache/spark/pull/46424]

> Variant expressions (all collations)
> 
>
> Key: SPARK-47354
> URL: https://issues.apache.org/jira/browse/SPARK-47354
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48208) Skip reporting memory usage metrics if bounded memory usage is enabled

2024-05-09 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48208:


Assignee: Anish Shrigondekar

> Skip reporting memory usage metrics if bounded memory usage is enabled
> --
>
> Key: SPARK-48208
> URL: https://issues.apache.org/jira/browse/SPARK-48208
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Skip reporting memory usage metrics if bounded memory usage is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48208) Skip reporting memory usage metrics if bounded memory usage is enabled

2024-05-09 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48208.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46491
[https://github.com/apache/spark/pull/46491]

> Skip reporting memory usage metrics if bounded memory usage is enabled
> --
>
> Key: SPARK-48208
> URL: https://issues.apache.org/jira/browse/SPARK-48208
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Skip reporting memory usage metrics if bounded memory usage is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47421) URL expressions (all collations)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47421:
---

Assignee: Uroš Bojanić

> URL expressions (all collations)
> 
>
> Key: SPARK-47421
> URL: https://issues.apache.org/jira/browse/SPARK-47421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47421) URL expressions (all collations)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47421.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46460
[https://github.com/apache/spark/pull/46460]

> URL expressions (all collations)
> 
>
> Key: SPARK-47421
> URL: https://issues.apache.org/jira/browse/SPARK-47421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47986) [CONNECT][PYTHON] Unable to create a new session when the default session is closed by the server

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47986.
--
Resolution: Fixed

Issue resolved by pull request 46435
[https://github.com/apache/spark/pull/46435]

> [CONNECT][PYTHON] Unable to create a new session when the default session is 
> closed by the server
> -
>
> Key: SPARK-47986
> URL: https://issues.apache.org/jira/browse/SPARK-47986
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When the server closes a session, usually after a cluster restart, the client 
> is unaware of this until it receives an error.
> Once it does so, there is no way for the client to create a new session since 
> the stale sessions are still recorded as default and active sessions.
> The only solution currently is to restart the Python interpreter on the 
> client, or to reach into the session builder and change the active or default 
> session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47365) Add toArrow() DataFrame method to PySpark

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47365:


Assignee: Ian Cook

> Add toArrow() DataFrame method to PySpark
> -
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose a *toArrow()* method of the Spark 
> DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47365) Add toArrow() DataFrame method to PySpark

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47365.
--
Resolution: Fixed

Issue resolved by pull request 45481
[https://github.com/apache/spark/pull/45481]

> Add toArrow() DataFrame method to PySpark
> -
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose a *toArrow()* method of the Spark 
> DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48212) Fully enable PandasUDFParityTests. test_udf_wrong_arg

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48212.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46498
[https://github.com/apache/spark/pull/46498]

> Fully enable PandasUDFParityTests. test_udf_wrong_arg
> -
>
> Key: SPARK-48212
> URL: https://issues.apache.org/jira/browse/SPARK-48212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48175:
--

Assignee: Apache Spark

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48175:
--

Assignee: (was: Apache Spark)

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48213) Do not push down predicate if non-cheap expression exceed reused limit

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48213:
--

Assignee: (was: Apache Spark)

> Do not push down predicate if non-cheap expression exceed reused limit
> --
>
> Key: SPARK-48213
> URL: https://issues.apache.org/jira/browse/SPARK-48213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48214) Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`

2024-05-09 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48214:
---

 Summary: Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`
 Key: SPARK-48214
 URL: https://issues.apache.org/jira/browse/SPARK-48214
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48211) DB2: Read SMALLINT as ShortType

2024-05-09 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48211:


Assignee: Kent Yao

> DB2: Read SMALLINT as ShortType
> ---
>
> Key: SPARK-48211
> URL: https://issues.apache.org/jira/browse/SPARK-48211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48214) Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48214:
---
Labels: pull-request-available  (was: )

> Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`
> -
>
> Key: SPARK-48214
> URL: https://issues.apache.org/jira/browse/SPARK-48214
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48211) DB2: Read SMALLINT as ShortType

2024-05-09 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48211.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46497
[https://github.com/apache/spark/pull/46497]

> DB2: Read SMALLINT as ShortType
> ---
>
> Key: SPARK-48211
> URL: https://issues.apache.org/jira/browse/SPARK-48211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47415) Levenshtein (all collations)

2024-05-09 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47415:
-
Description: 
Enable collation support for the *Levenshtein* built-in string function in 
Spark. First confirm what is the expected behaviour for this function when 
given collated strings, and then move on to implementation and testing. 
Implement the corresponding unit tests and E2E sql tests to reflect how this 
function should be used with collation in SparkSQL, and feel free to use your 
chosen Spark SQL Editor to experiment with the existing functions to learn more 
about how they work. In addition, look into the possible use-cases and 
implementation of similar functions within other other open-source DBMS, such 
as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *Levenshtein* function so it 
supports all collation types currently supported in Spark. To understand what 
changes were introduced in order to enable full collation support for other 
existing functions in Spark, take a look at the Spark PRs and Jira tickets for 
completed tasks in this parent (for example: Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string [searching|https://www.unicode.org/reports/tr10/#Searching] 
and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47415) Levenshtein (all collations)

2024-05-09 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844937#comment-17844937
 ] 

Uroš Bojanić commented on SPARK-47415:
--

Update: [~nikolamand-db] has implemented this function in 
[https://github.com/apache/spark/pull/45963], so most of the implementation 
logic and tests should be there. However, we have recently done some 
refactoring in: https://issues.apache.org/jira/browse/SPARK-47410. Now we need 
to refactor Nikola's changes by following the guidelines outlined in that Jira 
ticket.

Nikola suggested this could be a good onboarding task for Nebojsa, so he could 
get familiar with part of the codebase.

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48215) DateFormatClass (all collations)

2024-05-09 Thread Jira

Uroš Bojanić created SPARK-48215:


 Summary: DateFormatClass (all collations)
 Key: SPARK-48215
 URL: https://issues.apache.org/jira/browse/SPARK-48215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48215) DateFormatClass (all collations)

2024-05-09 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48215:
-
Description: 
Enable collation support for the *DateFormatClass* built-in function in Spark. 
First confirm what is the expected behaviour for this expression when given 
collated strings, and then move on to implementation and testing. You will find 
this expression in the *datetimeExpressions.scala* file, and it should be 
considered a pass-through function with respect to collation awareness. 
Implement the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *DateFormatClass* expression 
so that it supports all collation types currently supported in Spark. To 
understand what changes were introduced in order to enable full collation 
support for other existing functions in Spark, take a look at the Spark PRs and 
Jira tickets for completed tasks in this parent (for example: Ascii, Chr, 
Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, 
Sentences).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> DateFormatClass (all collations)
> 
>
> Key: SPARK-48215
> URL: https://issues.apache.org/jira/browse/SPARK-48215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *DateFormatClass* built-in function in 
> Spark. First confirm what is the expected behaviour for this expression when 
> given collated strings, and then move on to implementation and testing. You 
> will find this expression in the *datetimeExpressions.scala* file, and it 
> should be considered a pass-through function with respect to collation 
> awareness. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *DateFormatClass* 
> expression so that it supports all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48216) Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread Kent Yao (Jira)

Kent Yao created SPARK-48216:


 Summary: Remove overrides 
DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable
 Key: SPARK-48216
 URL: https://issues.apache.org/jira/browse/SPARK-48216
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker, Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48217) Spark stdout and stderr getting removed at end of spark job triggered from cloudera hue workflow

2024-05-09 Thread Noam Shemesh (Jira)

Noam Shemesh created SPARK-48217:


 Summary: Spark stdout and stderr getting removed at end of spark 
job triggered from cloudera hue workflow
 Key: SPARK-48217
 URL: https://issues.apache.org/jira/browse/SPARK-48217
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.6.0
Reporter: Noam Shemesh


Hello,

we are running spark job triggered from cloudera hue workflow

and spark printing stdout and stderr logs during execution as expected:

e.g.

!image-2024-05-09-12-55-55-477.png|width=638,height=332!

*But stdout and stderr logs getting cleaned when workflows finished/status 
succeeded*

!image-2024-05-09-12-57-12-144.png!

 

following is spark-submit command workflow is triggering:

_/usr/bin/spark-submit \_
  _--master yarn-client \_
  _--driver-memory 4g \_
  _--executor-memory 16g \_
  _--executor-cores 4 \_
  _--class tst \_
  _--files `ls -m *.conf | tr -d '\n '` \_
  _--conf "spark.dynamicAllocation.maxExecutors=4" \_
  _--conf "spark.kryoserializer.buffer.max=1024" \_
  _tst.jar $*_

 

 

does someone familiar with this spark job behavior or can advise ideas to fix 
it?

 

Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48216) Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48216:
---
Labels: pull-request-available  (was: )

> Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related 
> tests configurable
> 
>
> Key: SPARK-48216
> URL: https://issues.apache.org/jira/browse/SPARK-48216
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48217) Spark stdout and stderr getting removed at end of spark job triggered from cloudera hue workflow

2024-05-09 Thread Noam Shemesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Shemesh updated SPARK-48217:
-
Attachment: workflow_running_logs_printed.png
workflow_succeeded_logs_cleaned.png

> Spark stdout and stderr getting removed at end of spark job triggered from 
> cloudera hue workflow
> 
>
> Key: SPARK-48217
> URL: https://issues.apache.org/jira/browse/SPARK-48217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Noam Shemesh
>Priority: Major
> Attachments: workflow_running_logs_printed.png, 
> workflow_succeeded_logs_cleaned.png
>
>
> Hello,
> we are running spark job triggered from cloudera hue workflow
> and spark printing stdout and stderr logs during execution as expected:
> e.g.
> !image-2024-05-09-12-55-55-477.png|width=638,height=332!
> *But stdout and stderr logs getting cleaned when workflows finished/status 
> succeeded*
> !image-2024-05-09-12-57-12-144.png!
>  
> following is spark-submit command workflow is triggering:
> _/usr/bin/spark-submit \_
>   _--master yarn-client \_
>   _--driver-memory 4g \_
>   _--executor-memory 16g \_
>   _--executor-cores 4 \_
>   _--class tst \_
>   _--files `ls -m *.conf | tr -d '\n '` \_
>   _--conf "spark.dynamicAllocation.maxExecutors=4" \_
>   _--conf "spark.kryoserializer.buffer.max=1024" \_
>   _tst.jar $*_
>  
>  
> does someone familiar with this spark job behavior or can advise ideas to fix 
> it?
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48217) Spark stdout and stderr getting removed at end of spark job triggered from cloudera hue workflow

2024-05-09 Thread Noam Shemesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Shemesh updated SPARK-48217:
-
Description: 
Hello,

we are running spark job triggered from cloudera hue workflow

and spark printing stdout and stderr logs during execution as expected:

e.g. -  !workflow_running_logs_printed.png!

 

*But stdout and stderr logs getting cleaned when workflows finished/status 
succeeded*

 

 

following is spark-submit command workflow is triggering:

_/usr/bin/spark-submit _
  _--master yarn-client _
  _--driver-memory 4g _
  _--executor-memory 16g _
  _--executor-cores 4 _
  _--class tst _
  _--files `ls -m *.conf | tr -d '\n '` _
  _--conf "spark.dynamicAllocation.maxExecutors=4" _
  _--conf "spark.kryoserializer.buffer.max=1024" _
  _tst.jar $*_

 

 

does someone familiar with this spark job behavior or can advise ideas to fix 
it?

 

Thanks in advance

  was:
Hello,

we are running spark job triggered from cloudera hue workflow

and spark printing stdout and stderr logs during execution as expected:

e.g.

!image-2024-05-09-12-55-55-477.png|width=638,height=332!

*But stdout and stderr logs getting cleaned when workflows finished/status 
succeeded*

!image-2024-05-09-12-57-12-144.png!

 

following is spark-submit command workflow is triggering:

_/usr/bin/spark-submit \_
  _--master yarn-client \_
  _--driver-memory 4g \_
  _--executor-memory 16g \_
  _--executor-cores 4 \_
  _--class tst \_
  _--files `ls -m *.conf | tr -d '\n '` \_
  _--conf "spark.dynamicAllocation.maxExecutors=4" \_
  _--conf "spark.kryoserializer.buffer.max=1024" \_
  _tst.jar $*_

 

 

does someone familiar with this spark job behavior or can advise ideas to fix 
it?

 

Thanks in advance


> Spark stdout and stderr getting removed at end of spark job triggered from 
> cloudera hue workflow
> 
>
> Key: SPARK-48217
> URL: https://issues.apache.org/jira/browse/SPARK-48217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Noam Shemesh
>Priority: Major
> Attachments: workflow_running_logs_printed.png, 
> workflow_succeeded_logs_cleaned.png
>
>
> Hello,
> we are running spark job triggered from cloudera hue workflow
> and spark printing stdout and stderr logs during execution as expected:
> e.g. -  !workflow_running_logs_printed.png!
>  
> *But stdout and stderr logs getting cleaned when workflows finished/status 
> succeeded*
>  
>  
> following is spark-submit command workflow is triggering:
> _/usr/bin/spark-submit _
>   _--master yarn-client _
>   _--driver-memory 4g _
>   _--executor-memory 16g _
>   _--executor-cores 4 _
>   _--class tst _
>   _--files `ls -m *.conf | tr -d '\n '` _
>   _--conf "spark.dynamicAllocation.maxExecutors=4" _
>   _--conf "spark.kryoserializer.buffer.max=1024" _
>   _tst.jar $*_
>  
>  
> does someone familiar with this spark job behavior or can advise ideas to fix 
> it?
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48217) Spark stdout and stderr getting removed at end of spark job triggered from cloudera hue workflow

2024-05-09 Thread Noam Shemesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Shemesh updated SPARK-48217:
-
Description: 
Hello,

we are running spark job triggered from cloudera hue workflow

and spark printing stdout and stderr logs during execution as expected:

e.g. -  !workflow_running_logs_printed.png!

 

*But stdout and stderr logs getting cleaned when workflows finished/status 
succeeded*

!workflow_succeeded_logs_cleaned.png!

 

following is spark-submit command workflow is triggering:

_/usr/bin/spark-submit _
  _--master yarn-client _
  _--driver-memory 4g _
  _--executor-memory 16g _
  _--executor-cores 4 _
  _--class tst _
  _--files `ls -m *.conf | tr -d '\n '` _
  _--conf "spark.dynamicAllocation.maxExecutors=4" _
  _--conf "spark.kryoserializer.buffer.max=1024" _
  _tst.jar $*_

 

 

does someone familiar with this spark job behavior or can advise ideas to fix 
it?

 

Thanks in advance

  was:
Hello,

we are running spark job triggered from cloudera hue workflow

and spark printing stdout and stderr logs during execution as expected:

e.g. -  !workflow_running_logs_printed.png!

 

*But stdout and stderr logs getting cleaned when workflows finished/status 
succeeded*

 

 

following is spark-submit command workflow is triggering:

_/usr/bin/spark-submit _
  _--master yarn-client _
  _--driver-memory 4g _
  _--executor-memory 16g _
  _--executor-cores 4 _
  _--class tst _
  _--files `ls -m *.conf | tr -d '\n '` _
  _--conf "spark.dynamicAllocation.maxExecutors=4" _
  _--conf "spark.kryoserializer.buffer.max=1024" _
  _tst.jar $*_

 

 

does someone familiar with this spark job behavior or can advise ideas to fix 
it?

 

Thanks in advance


> Spark stdout and stderr getting removed at end of spark job triggered from 
> cloudera hue workflow
> 
>
> Key: SPARK-48217
> URL: https://issues.apache.org/jira/browse/SPARK-48217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Noam Shemesh
>Priority: Major
> Attachments: workflow_running_logs_printed.png, 
> workflow_succeeded_logs_cleaned.png
>
>
> Hello,
> we are running spark job triggered from cloudera hue workflow
> and spark printing stdout and stderr logs during execution as expected:
> e.g. -  !workflow_running_logs_printed.png!
>  
> *But stdout and stderr logs getting cleaned when workflows finished/status 
> succeeded*
> !workflow_succeeded_logs_cleaned.png!
>  
> following is spark-submit command workflow is triggering:
> _/usr/bin/spark-submit _
>   _--master yarn-client _
>   _--driver-memory 4g _
>   _--executor-memory 16g _
>   _--executor-cores 4 _
>   _--class tst _
>   _--files `ls -m *.conf | tr -d '\n '` _
>   _--conf "spark.dynamicAllocation.maxExecutors=4" _
>   _--conf "spark.kryoserializer.buffer.max=1024" _
>   _tst.jar $*_
>  
>  
> does someone familiar with this spark job behavior or can advise ideas to fix 
> it?
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48218) TransportClientFactory.createClient may NPE cause FetchFailedException

2024-05-09 Thread dzcxzl (Jira)

dzcxzl created SPARK-48218:
--

 Summary: TransportClientFactory.createClient may NPE cause 
FetchFailedException
 Key: SPARK-48218
 URL: https://issues.apache.org/jira/browse/SPARK-48218
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 4.0.0
Reporter: dzcxzl




{code:java}
org.apache.spark.shuffle.FetchFailedException
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:913)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)

Caused by: java.lang.NullPointerException
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:178)
at 
org.apache.spark.network.shuffle.ExternalBlockStoreClient.lambda$fetchBlocks$0(ExternalBlockStoreClient.java:128)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:133)
at 
org.apache.spark.network.shuffle.ExternalBlockStoreClient.fetchBlocks(ExternalBlockStoreClient.java:139)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48218) TransportClientFactory.createClient may NPE cause FetchFailedException

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48218:
---
Labels: pull-request-available  (was: )

> TransportClientFactory.createClient may NPE cause FetchFailedException
> --
>
> Key: SPARK-48218
> URL: https://issues.apache.org/jira/browse/SPARK-48218
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> org.apache.spark.shuffle.FetchFailedException
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:913)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:178)
>   at 
> org.apache.spark.network.shuffle.ExternalBlockStoreClient.lambda$fetchBlocks$0(ExternalBlockStoreClient.java:128)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:133)
>   at 
> org.apache.spark.network.shuffle.ExternalBlockStoreClient.fetchBlocks(ExternalBlockStoreClient.java:139)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47415) Levenshtein (all collations)

2024-05-09 Thread Nebojsa Savic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844962#comment-17844962
 ] 

Nebojsa Savic commented on SPARK-47415:
---

Taking over this work.

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47008) Spark to support S3 Express One Zone Storage

2024-05-09 Thread Leo Timofeyev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
 ] 

Leo Timofeyev commented on SPARK-47008:
---

Hey [~ste...@apache.org] 

What do you think about something like this variant?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}

> Spark to support S3 Express One Zone Storage
> 
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48219) StreamReader Charset fix with UTF8

2024-05-09 Thread xy (Jira)

xy created SPARK-48219:
--

 Summary: StreamReader Charset fix with UTF8
 Key: SPARK-48219
 URL: https://issues.apache.org/jira/browse/SPARK-48219
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.3
Reporter: xy
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48219) StreamReader Charset fix with UTF8

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48219:
---
Labels: pull-request-available  (was: )

> StreamReader Charset fix with UTF8
> --
>
> Key: SPARK-48219
> URL: https://issues.apache.org/jira/browse/SPARK-48219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()

2024-05-09 Thread Ian Cook (Jira)

Ian Cook created SPARK-48220:


 Summary: Allow passing PyArrow Table to createDataFrame()
 Key: SPARK-48220
 URL: https://issues.apache.org/jira/browse/SPARK-48220
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Input/Output, PySpark, SQL
Affects Versions: 3.5.1, 4.0.0
Reporter: Ian Cook


SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.

It would be nice if we could also go in the opposite direction, enabling users 
to create a Spark DataFrame from a PyArrow Table by passing the PyArrow table 
to {{spark.createDataFrame()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches

2024-05-09 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47466:
-
Description: 
As a follow-up to SPARK-47365:

{{toArrow()}} is useful when the data is relatively small. For larger data, the 
best way to return the contents of a PySpark DataFrame in Arrow format is to 
return an iterator of [PyArrow 
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].

  was:
As a follow-up to SPARK-47365:

*toArrow()* is useful when the data is relatively small. For larger data, the 
best way to return the contents of a PySpark DataFrame in Arrow format is to 
return an iterator of [PyArrow 
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].


> Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
> 
>
> Key: SPARK-47466
> URL: https://issues.apache.org/jira/browse/SPARK-47466
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> {{toArrow()}} is useful when the data is relatively small. For larger data, 
> the best way to return the contents of a PySpark DataFrame in Arrow format is 
> to return an iterator of [PyArrow 
> RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()

2024-05-09 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48220:
-
Description: 
SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.

It would be nice if we could also go in the opposite direction, enabling users 
to create a Spark DataFrame from a PyArrow Table by passing the PyArrow Table 
to {{spark.createDataFrame()}}.

  was:
SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.

It would be nice if we could also go in the opposite direction, enabling users 
to create a Spark DataFrame from a PyArrow Table by passing the PyArrow table 
to {{spark.createDataFrame()}}.


> Allow passing PyArrow Table to createDataFrame()
> 
>
> Key: SPARK-48220
> URL: https://issues.apache.org/jira/browse/SPARK-48220
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.
> It would be nice if we could also go in the opposite direction, enabling 
> users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow 
> Table to {{spark.createDataFrame()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47465) Remove experimental tag from toArrow() PySpark DataFrame method

2024-05-09 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47465:
-
Fix Version/s: 4.0.0

> Remove experimental tag from toArrow() PySpark DataFrame method
> ---
>
> Key: SPARK-47465
> URL: https://issues.apache.org/jira/browse/SPARK-47465
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> As a follow-up to SPARK-47365:
> What is needed to consider making the *toArrow()* PySpark DataFrame 
> non-experimental?
> What can the Apache Arrow developers do to help with this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34679) inferTimestamp option is missing from the list of options in DataFrameReader.json.

2024-05-09 Thread sadha chilukoori (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844995#comment-17844995
 ] 

sadha chilukoori commented on SPARK-34679:
--

[~gurwls223] , [~P7hB] I'm interested in contributing to documentation. Is this 
issue resolved? or could I work on it. 

Also if you have any starter tasks please let me know.

Thank you.

> inferTimestamp option is missing from the list of options in 
> DataFrameReader.json.
> --
>
> Key: SPARK-34679
> URL: https://issues.apache.org/jira/browse/SPARK-34679
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Prashanth Babu
>Priority: Minor
>  Labels: documentation, easyfix, newbie, starter
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> {color:#e01e5a}inferTimestamp{color} option is missing in the list of options 
> in {color:#e01e5a}DataFrameReader.json{color} method in the API docs missing 
> from the [Scaladocs 
> here|[DataFrameReader.json|https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L432-L520]].
> Simiarly in the [Pyspark 
> docs|[pyspark.sql.DataFrameReader.json|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.json.html?highlight=json#pyspark.sql.DataFrameReader.json]]
>  as well.
> However we have this blurb in the [migration guide|[Spark 3.0 to 3.0.1 
> migration 
> guide|https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-301]]
>  * In Spark 3.0, JSON datasource and JSON function {{schema_of_json}} infer 
> TimestampType from string values if they match to the pattern defined by the 
> JSON option {{timestampFormat}}. Since version 3.0.1, the timestamp type 
> inference is disabled by default. Set the JSON option {{inferTimestamp}} to 
> {{true}} to enable such type inference.
> We should add this in the documentation as well as there is a possibility 
> that the Data Engineers might not be aware of this option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48187) Run `docs` only in PR builders and `build_non_ansi` Daily CI

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48187:
--
Fix Version/s: (was: 4.0.0)

> Run `docs` only in PR builders and `build_non_ansi` Daily CI
> 
>
> Key: SPARK-48187
> URL: https://issues.apache.org/jira/browse/SPARK-48187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48187) Run `docs` only in PR builders and `build_non_ansi` Daily CI

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48187:
--
Fix Version/s: 4.0.0

> Run `docs` only in PR builders and `build_non_ansi` Daily CI
> 
>
> Key: SPARK-48187
> URL: https://issues.apache.org/jira/browse/SPARK-48187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Fix Version/s: (was: 4.0.0)

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 23.56.05.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> - https://infra-reports.apache.org/#ghactions&project=spark&hours=168
>  !Screenshot 2024-05-02 at 23.56.05.png|width=100%! 
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Priority: Major  (was: Blocker)

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 23.56.05.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> - https://infra-reports.apache.org/#ghactions&project=spark&hours=168
>  !Screenshot 2024-05-02 at 23.56.05.png|width=100%! 
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
 - [https://infra.apache.org/github-actions-policy.html]

h2. MONITORING
 - [https://infra-reports.apache.org/#ghactions&project=spark&hours=168]

!Screenshot 2024-05-02 at 23.56.05.png|width=100!
h2. TARGET
 * All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
 * All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
 * The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
 * The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
{quote}17th of May, 2024
{quote}

  was:
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
- https://infra-reports.apache.org/#ghactions&project=spark&hours=168

 !Screenshot 2024-05-02 at 23.56.05.png|width=100%! 

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.




> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 23.56.05.png
>
>
> h2. ASF INFRA POLICY
>  - [https://infra.apache.org/github-actions-policy.html]
> h2. MONITORING
>  - [https://infra-reports.apache.org/#ghactions&project=spark&hours=168]
> !Screenshot 2024-05-02 at 23.56.05.png|width=100!
> h2. TARGET
>  * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
>  * All workflows SHOULD have a job concurrency level less than or equal to 
> 15. Just because 20 is the max, doesn't mean you should strive for 20.
>  * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
>  * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> {quote}17th of May, 2024
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47409) StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47409.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46206
[https://github.com/apache/spark/pull/46206]

> StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)
> --
>
> Key: SPARK-47409
> URL: https://issues.apache.org/jira/browse/SPARK-47409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *StringTrim* built-in string function in 
> Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, 
> {*}StringTrimRight{*}). First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|[https://www.postgresql.org/docs/]].
>  
> The goal for this Jira ticket is to implement the *StringTrim* function so it 
> supports binary & lowercase collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47409) StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47409:
---

Assignee: David Milicevic

> StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)
> --
>
> Key: SPARK-47409
> URL: https://issues.apache.org/jira/browse/SPARK-47409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringTrim* built-in string function in 
> Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, 
> {*}StringTrimRight{*}). First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|[https://www.postgresql.org/docs/]].
>  
> The goal for this Jira ticket is to implement the *StringTrim* function so it 
> supports binary & lowercase collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48176:
---
Labels: pull-request-available  (was: )

> Fix name of FIELD_ALREADY_EXISTS error condition
> 
>
> Key: SPARK-48176
> URL: https://issues.apache.org/jira/browse/SPARK-48176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48221) Alter string search logic for UTF8_BINARY_LCASE collation

2024-05-09 Thread Jira

Uroš Bojanić created SPARK-48221:


 Summary: Alter string search logic for UTF8_BINARY_LCASE collation
 Key: SPARK-48221
 URL: https://issues.apache.org/jira/browse/SPARK-48221
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48221) Alter string search logic for UTF8_BINARY_LCASE collation

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48221:
---
Labels: pull-request-available  (was: )

> Alter string search logic for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-48221
> URL: https://issues.apache.org/jira/browse/SPARK-48221
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48216) Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48216:
-

Assignee: Kent Yao

> Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related 
> tests configurable
> 
>
> Key: SPARK-48216
> URL: https://issues.apache.org/jira/browse/SPARK-48216
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48216) Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48216.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46505
[https://github.com/apache/spark/pull/46505]

> Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related 
> tests configurable
> 
>
> Key: SPARK-48216
> URL: https://issues.apache.org/jira/browse/SPARK-48216
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-48222:


 Summary: Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
 Key: SPARK-48222
 URL: https://issues.apache.org/jira/browse/SPARK-48222
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-48222:
-
Component/s: Documentation

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48222:
---
Labels: pull-request-available  (was: )

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47008) Spark to support S3 Express One Zone Storage

2024-05-09 Thread Leo Timofeyev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
 ] 

Leo Timofeyev edited comment on SPARK-47008 at 5/9/24 5:46 PM:
---

Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}


was (Author: JIRAUSER303957):
Hey [~ste...@apache.org] 

What do you think about something like this variant?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}

> Spark to support S3 Express One Zone Storage
> 
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47008) Spark to support S3 Express One Zone Storage

2024-05-09 Thread Leo Timofeyev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
 ] 

Leo Timofeyev edited comment on SPARK-47008 at 5/9/24 5:49 PM:
---

Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}


was (Author: JIRAUSER303957):
Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}

> Spark to support S3 Express One Zone Storage
> 
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47982) Update code style' plugins to latest version

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47982:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Update code style' plugins to latest version
> 
>
> Key: SPARK-47982
> URL: https://issues.apache.org/jira/browse/SPARK-47982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47930) Upgrade RoaringBitmap to 1.0.6

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47930:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade RoaringBitmap to 1.0.6
> --
>
> Key: SPARK-47930
> URL: https://issues.apache.org/jira/browse/SPARK-47930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48223) PrometheusServlet does not work well with queryName in the metric name

2024-05-09 Thread Mason Chen (Jira)

Mason Chen created SPARK-48223:
--

 Summary: PrometheusServlet does not work well with queryName in 
the metric name
 Key: SPARK-48223
 URL: https://issues.apache.org/jira/browse/SPARK-48223
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.3, 3.3.4, 3.5.1
Reporter: Mason Chen


PrometheusServlet does not work well with queryName in the metric name.

For driver metrics, the configuration `spark.metrics.namespace` allows the user 
to remove the app id from the metric name to build reusable dashboards and 
metric queries in external systems.

Similarly, we need a feature to replace the query name in the spark streaming 
metric names to make dashboards and metric queries reusable. Ideally, the query 
name would be generated as a tag, instead of part of the metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48223) PrometheusServlet does not work well with queryName in the metric name

2024-05-09 Thread Mason Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mason Chen updated SPARK-48223:
---
Description: 
PrometheusServlet does not work well with queryName in the metric name.

For driver metrics, the configuration `spark.metrics.namespace` allows the user 
to remove the app id from the metric name to build reusable dashboards and 
metric queries in external systems.

Similarly, we need a feature to replace the query name in the spark streaming 
metric names to make dashboards and metric queries reusable. Ideally, the query 
name would be generated as a tag, instead of part of the metric name.

This can be enabled if a feature flag is enabled, to not break existing users 
who are dependent of the existing metric format.

  was:
PrometheusServlet does not work well with queryName in the metric name.

For driver metrics, the configuration `spark.metrics.namespace` allows the user 
to remove the app id from the metric name to build reusable dashboards and 
metric queries in external systems.

Similarly, we need a feature to replace the query name in the spark streaming 
metric names to make dashboards and metric queries reusable. Ideally, the query 
name would be generated as a tag, instead of part of the metric name.


> PrometheusServlet does not work well with queryName in the metric name
> --
>
> Key: SPARK-48223
> URL: https://issues.apache.org/jira/browse/SPARK-48223
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.3.4, 3.4.3
>Reporter: Mason Chen
>Priority: Major
>
> PrometheusServlet does not work well with queryName in the metric name.
> For driver metrics, the configuration `spark.metrics.namespace` allows the 
> user to remove the app id from the metric name to build reusable dashboards 
> and metric queries in external systems.
> Similarly, we need a feature to replace the query name in the spark streaming 
> metric names to make dashboards and metric queries reusable. Ideally, the 
> query name would be generated as a tag, instead of part of the metric name.
> This can be enabled if a feature flag is enabled, to not break existing users 
> who are dependent of the existing metric format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48223) PrometheusServlet does not work well with queryName in spark streaming metric names

2024-05-09 Thread Mason Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mason Chen updated SPARK-48223:
---
Summary: PrometheusServlet does not work well with queryName in spark 
streaming metric names  (was: PrometheusServlet does not work well with 
queryName in the metric name)

> PrometheusServlet does not work well with queryName in spark streaming metric 
> names
> ---
>
> Key: SPARK-48223
> URL: https://issues.apache.org/jira/browse/SPARK-48223
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.3.4, 3.4.3
>Reporter: Mason Chen
>Priority: Major
>
> PrometheusServlet does not work well with queryName in the metric name.
> For driver metrics, the configuration `spark.metrics.namespace` allows the 
> user to remove the app id from the metric name to build reusable dashboards 
> and metric queries in external systems.
> Similarly, we need a feature to replace the query name in the spark streaming 
> metric names to make dashboards and metric queries reusable. Ideally, the 
> query name would be generated as a tag, instead of part of the metric name.
> This can be enabled if a feature flag is enabled, to not break existing users 
> who are dependent of the existing metric format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47008) Spark to support S3 Express One Zone Storage

2024-05-09 Thread Leo Timofeyev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
 ] 

Leo Timofeyev edited comment on SPARK-47008 at 5/9/24 7:12 PM:
---

Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  }

  if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus)
}{code}


was (Author: JIRAUSER303957):
Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
  fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
  case _: IOException => false
}
val statusResult = Try {
  fs.listStatus(status.getPath)
}
statusResult match {
  case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
  Seq.empty[FileStatus]
}
else throw e
  case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
}
  } {code}

> Spark to support S3 Express One Zone Storage
> 
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48224) Disable variant from being a part of a map key

2024-05-09 Thread Harsh Motwani (Jira)

Harsh Motwani created SPARK-48224:
-

 Summary: Disable variant from being a part of a map key
 Key: SPARK-48224
 URL: https://issues.apache.org/jira/browse/SPARK-48224
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Harsh Motwani


Creating a map object with a variant key currently works. However, this 
behavior should be disabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47172:
---
Labels: pull-request-available  (was: )

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48224) Disable variant from being a part of a map key

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48224:
---
Labels: pull-request-available  (was: )

> Disable variant from being a part of a map key
> --
>
> Key: SPARK-48224
> URL: https://issues.apache.org/jira/browse/SPARK-48224
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Creating a map object with a variant key currently works. However, this 
> behavior should be disabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44810) XML: ArrayType and MapType support in from_xml

2024-05-09 Thread HiuFung Kwok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845125#comment-17845125
 ] 

HiuFung Kwok commented on SPARK-44810:
--

[~sandip.agarwala] Should this be closed also, it seems the functionality has 
already been ported, and relevant tests exist under `XmlSuite.scala`.

> XML: ArrayType and MapType support in from_xml
> --
>
> Key: SPARK-44810
> URL: https://issues.apache.org/jira/browse/SPARK-44810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44810) XML: ArrayType and MapType support in from_xml

2024-05-09 Thread HiuFung Kwok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845125#comment-17845125
 ] 

HiuFung Kwok edited comment on SPARK-44810 at 5/9/24 8:49 PM:
--

[~sandip.agarwala] Should this be closed also, it seems the functionality has 
already been ported, and relevant tests exist under `XmlSuite.scala`.

 

and same for the other sub-tasks under this EPIC.


was (Author: hf):
[~sandip.agarwala] Should this be closed also, it seems the functionality has 
already been ported, and relevant tests exist under `XmlSuite.scala`.

> XML: ArrayType and MapType support in from_xml
> --
>
> Key: SPARK-44810
> URL: https://issues.apache.org/jira/browse/SPARK-44810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48164) Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48164:
--
Priority: Major  (was: Blocker)

> Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - 
> get_resources_command`
> --
>
> Key: SPARK-48164
> URL: https://issues.apache.org/jira/browse/SPARK-48164
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48164) Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48164:
--
Parent: (was: SPARK-44111)
Issue Type: Bug  (was: Sub-task)

> Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - 
> get_resources_command`
> --
>
> Key: SPARK-48164
> URL: https://issues.apache.org/jira/browse/SPARK-48164
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-48164) Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-48164.
-

> Re-enable `SparkConnectServiceSuite.SPARK-43923: commands send events - 
> get_resources_command`
> --
>
> Key: SPARK-48164
> URL: https://issues.apache.org/jira/browse/SPARK-48164
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48225) Upgrade `sbt` to 1.10.0

2024-05-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-48225:
-

 Summary: Upgrade `sbt` to 1.10.0
 Key: SPARK-48225
 URL: https://issues.apache.org/jira/browse/SPARK-48225
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48225) Upgrade `sbt` to 1.10.0

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48225:
---
Labels: pull-request-available  (was: )

> Upgrade `sbt` to 1.10.0
> ---
>
> Key: SPARK-48225
> URL: https://issues.apache.org/jira/browse/SPARK-48225
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48226) Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

2024-05-09 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48226:
---

 Summary: Add `spark-ganglia-lgpl` to `lint-java` & 
`spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`
 Key: SPARK-48226
 URL: https://issues.apache.org/jira/browse/SPARK-48226
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48226) Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

2024-05-09 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48226:

Component/s: Build
 (was: Project Infra)

> Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and 
> `jvm-profiler` to `sbt-checkstyle`
> -
>
> Key: SPARK-48226
> URL: https://issues.apache.org/jira/browse/SPARK-48226
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48226) Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48226:
---
Labels: pull-request-available  (was: )

> Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and 
> `jvm-profiler` to `sbt-checkstyle`
> -
>
> Key: SPARK-48226
> URL: https://issues.apache.org/jira/browse/SPARK-48226
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48227) Document the requirement of seed in protos

2024-05-09 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48227:
-

 Summary: Document the requirement of seed in protos
 Key: SPARK-48227
 URL: https://issues.apache.org/jira/browse/SPARK-48227
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48227) Document the requirement of seed in protos

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48227:
---
Labels: pull-request-available  (was: )

> Document the requirement of seed in protos
> --
>
> Key: SPARK-48227
> URL: https://issues.apache.org/jira/browse/SPARK-48227
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48082) Recover compatibility with Spark Connect client 3.5 <> Spark Connect server 4.0

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48082.
--
  Assignee: Hyukjin Kwon
Resolution: Done

>  Recover compatibility with Spark Connect client 3.5 <> Spark Connect server 
> 4.0
> 
>
> Key: SPARK-48082
> URL: https://issues.apache.org/jira/browse/SPARK-48082
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/46298#issuecomment-2087905857
> There are test failures identified when you run Spark 3.5 Spark Connect 
> client <> Spark Connect server 4.0.
> They should ideally be compatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48089.
--
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by pull request 46513
[https://github.com/apache/spark/pull/46513]

> Streaming query listener not working in 3.5 client <> 4.0 server
> 
>
> Key: SPARK-48089
> URL: https://issues.apache.org/jira/browse/SPARK-48089
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2
>
>
> {code}
> ==
> ERROR [1.488s]: test_listener_events 
> (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events)
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py",
>  line 53, in test_listener_events
> self.spark.streams.addListener(test_listener)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 244, in addListener
> self._execute_streaming_query_manager_cmd(cmd)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 260, in _execute_streaming_query_manager_cmd
> (_, properties) = self._session.client.execute_command(exec_cmd)
>   ^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 982, in execute_command
> data, _, _, _, properties = self._execute_and_fetch(req)
> 
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
> for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
> self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
> self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (java.io.EOFException) 
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48089:


Assignee: Wei Liu

> Streaming query listener not working in 3.5 client <> 4.0 server
> 
>
> Key: SPARK-48089
> URL: https://issues.apache.org/jira/browse/SPARK-48089
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> ==
> ERROR [1.488s]: test_listener_events 
> (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events)
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py",
>  line 53, in test_listener_events
> self.spark.streams.addListener(test_listener)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 244, in addListener
> self._execute_streaming_query_manager_cmd(cmd)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 260, in _execute_streaming_query_manager_cmd
> (_, properties) = self._session.client.execute_command(exec_cmd)
>   ^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 982, in execute_command
> data, _, _, _, properties = self._execute_and_fetch(req)
> 
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
> for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
> self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
> self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (java.io.EOFException) 
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48148) JSON objects should not be modified when read as STRING

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48148.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46408
[https://github.com/apache/spark/pull/46408]

> JSON objects should not be modified when read as STRING
> ---
>
> Key: SPARK-48148
> URL: https://issues.apache.org/jira/browse/SPARK-48148
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Maynard
>Assignee: Eric Maynard
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, when reading a JSON like this:
> bq. {"a": {"b": -999.995}}
> With the schema:
> bq. a STRING
> Spark will yield a result like this:
> bq. {"b": -1000.0}
> This is due to how we convert a non-string value to a string in JacksonParser



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48148) JSON objects should not be modified when read as STRING

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48148:


Assignee: Eric Maynard

> JSON objects should not be modified when read as STRING
> ---
>
> Key: SPARK-48148
> URL: https://issues.apache.org/jira/browse/SPARK-48148
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Maynard
>Assignee: Eric Maynard
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when reading a JSON like this:
> bq. {"a": {"b": -999.995}}
> With the schema:
> bq. a STRING
> Spark will yield a result like this:
> bq. {"b": -1000.0}
> This is due to how we convert a non-string value to a string in JacksonParser



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48226) Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48226:
-

Assignee: BingKun Pan

> Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and 
> `jvm-profiler` to `sbt-checkstyle`
> -
>
> Key: SPARK-48226
> URL: https://issues.apache.org/jira/browse/SPARK-48226
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48226) Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48226.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46501
[https://github.com/apache/spark/pull/46501]

> Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and 
> `jvm-profiler` to `sbt-checkstyle`
> -
>
> Key: SPARK-48226
> URL: https://issues.apache.org/jira/browse/SPARK-48226
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48180.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46451
[https://github.com/apache/spark/pull/46451]

> Analyzer bug with multiple ORDER BY items for input table argument
> --
>
> Key: SPARK-48180
> URL: https://issues.apache.org/jira/browse/SPARK-48180
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Steps to reproduce:
>  
> {{from pyspark.sql.functions import udtf}}
> {{@udtf(returnType="a: int, b: int")}}
> {{class tvf:}}
> {{  def eval(self, *args):}}
> {{    yield 1, 2}}
>  
> {{SELECT * FROM tvf(}}
> {{  TABLE(}}
> {{    SELECT 1 AS device_id, 2 AS data_ds}}
> {{    )}}
> {{    WITH SINGLE PARTITION}}
> {{    ORDER BY device_id, data_ds}}
> {{ )}}
> {{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
> Unsupported subquery expression: Table arguments are used in a function where 
> they are not supported:}}
> {{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
> false}}
> {{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
> {{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48180:


Assignee: Daniel

> Analyzer bug with multiple ORDER BY items for input table argument
> --
>
> Key: SPARK-48180
> URL: https://issues.apache.org/jira/browse/SPARK-48180
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
>  
> {{from pyspark.sql.functions import udtf}}
> {{@udtf(returnType="a: int, b: int")}}
> {{class tvf:}}
> {{  def eval(self, *args):}}
> {{    yield 1, 2}}
>  
> {{SELECT * FROM tvf(}}
> {{  TABLE(}}
> {{    SELECT 1 AS device_id, 2 AS data_ds}}
> {{    )}}
> {{    WITH SINGLE PARTITION}}
> {{    ORDER BY device_id, data_ds}}
> {{ )}}
> {{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
> Unsupported subquery expression: Table arguments are used in a function where 
> they are not supported:}}
> {{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
> false}}
> {{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
> {{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48227) Document the requirement of seed in protos

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48227:
-

Assignee: Ruifeng Zheng

> Document the requirement of seed in protos
> --
>
> Key: SPARK-48227
> URL: https://issues.apache.org/jira/browse/SPARK-48227
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48227) Document the requirement of seed in protos

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48227.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46518
[https://github.com/apache/spark/pull/46518]

> Document the requirement of seed in protos
> --
>
> Key: SPARK-48227
> URL: https://issues.apache.org/jira/browse/SPARK-48227
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48222.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46512
[https://github.com/apache/spark/pull/46512]

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48222:
---

Assignee: Nicholas Chammas

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48176.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46510
[https://github.com/apache/spark/pull/46510]

> Fix name of FIELD_ALREADY_EXISTS error condition
> 
>
> Key: SPARK-48176
> URL: https://issues.apache.org/jira/browse/SPARK-48176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48176:


Assignee: Nicholas Chammas

> Fix name of FIELD_ALREADY_EXISTS error condition
> 
>
> Key: SPARK-48176
> URL: https://issues.apache.org/jira/browse/SPARK-48176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27900) Spark driver will not exit due to an oom error

2024-05-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-27900:
---
Labels: pull-request-available  (was: )

> Spark driver will not exit due to an oom error
> --
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>  Labels: pull-request-available
>
> This affects Spark on K8s at least as pods will run forever and makes 
> impossible for tools like Spark Operator to report back
> job status.
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround

[jira] [Created] (SPARK-48228) Implement the missing function validation in ApplyInXXX

2024-05-09 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48228:
-

 Summary: Implement the missing function validation in ApplyInXXX
 Key: SPARK-48228
 URL: https://issues.apache.org/jira/browse/SPARK-48228
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47834) Mark deprecated functions with `@deprecated` in `SQLImplicits`

2024-05-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47834.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46029
[https://github.com/apache/spark/pull/46029]

> Mark deprecated functions with `@deprecated` in `SQLImplicits`
> --
>
> Key: SPARK-47834
> URL: https://issues.apache.org/jira/browse/SPARK-47834
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo