date:20231128

[jira] [Resolved] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21

2023-11-28 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-46154.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44070
[https://github.com/apache/spark/pull/44070]

> Add a new Daily testing GitHub Action job for Maven with Java 21
> 
>
> Key: SPARK-46154
> URL: https://issues.apache.org/jira/browse/SPARK-46154
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46170:
---
Labels: pull-request-available  (was: )

> Support inject adaptive query post planner strategy rules in 
> SparkSessionExtensions
> ---
>
> Key: SPARK-46170
> URL: https://issues.apache.org/jira/browse/SPARK-46170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2023-11-28 Thread XiDuo You (Jira)

XiDuo You created SPARK-46170:
-

 Summary: Support inject adaptive query post planner strategy rules 
in SparkSessionExtensions
 Key: SPARK-46170
 URL: https://issues.apache.org/jira/browse/SPARK-46170
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46169) Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API.

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46169:
---
Labels: pull-request-available  (was: )

> Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API.
> --
>
> Key: SPARK-46169
> URL: https://issues.apache.org/jira/browse/SPARK-46169
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> There are many TODO items has no actual JIRA number. We should assign proper 
> number for better tracking.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46169) Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46169:
---

 Summary: Assign appropriate JIRA numbers to unlabeled TODO items 
for DataFrame API.
 Key: SPARK-46169
 URL: https://issues.apache.org/jira/browse/SPARK-46169
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


There are many TODO items has no actual JIRA number. We should assign proper 
number for better tracking.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46168) Add axis parameter to DataFrame.idxmin & idxmax

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46168:
---

 Summary: Add axis parameter to DataFrame.idxmin & idxmax
 Key: SPARK-46168
 URL: https://issues.apache.org/jira/browse/SPARK-46168
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46167) Add axis, pct and na_option parameter to DataFrame.rank

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46167:
---

 Summary: Add axis, pct and na_option parameter to DataFrame.rank
 Key: SPARK-46167
 URL: https://issues.apache.org/jira/browse/SPARK-46167
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46165) Improve axis parameter for DataFrame.all to support columns.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46165:
---

 Summary: Improve axis parameter for DataFrame.all to support 
columns.
 Key: SPARK-46165
 URL: https://issues.apache.org/jira/browse/SPARK-46165
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46166) Add axis and skipna parameters to DataFrame.any

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46166:
---

 Summary: Add axis and skipna parameters to DataFrame.any
 Key: SPARK-46166
 URL: https://issues.apache.org/jira/browse/SPARK-46166
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46164) Add include and exclude parameters for DataFrame.describe

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46164:
---

 Summary: Add include and exclude parameters for DataFrame.describe
 Key: SPARK-46164
 URL: https://issues.apache.org/jira/browse/SPARK-46164
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46163) Add filter_func and errors parameter for DataFrame.update

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46163:
---

 Summary: Add filter_func and errors parameter for DataFrame.update
 Key: SPARK-46163
 URL: https://issues.apache.org/jira/browse/SPARK-46163
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46160) Add freq and axis parameters to DataFrame.shift

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46160:
---

 Summary: Add freq and axis parameters to DataFrame.shift
 Key: SPARK-46160
 URL: https://issues.apache.org/jira/browse/SPARK-46160
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46162) Improve axis parameter for DataFrame.nunique to support columns.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46162:
---

 Summary: Improve axis parameter for DataFrame.nunique to support 
columns.
 Key: SPARK-46162
 URL: https://issues.apache.org/jira/browse/SPARK-46162
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46161) Improve axis parameter for DataFrame.diff to support columns.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46161:
---

 Summary: Improve axis parameter for DataFrame.diff to support 
columns.
 Key: SPARK-46161
 URL: https://issues.apache.org/jira/browse/SPARK-46161
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46159) Improve axis parameter for DataFrame.at_time to support columns.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46159:
---

 Summary: Improve axis parameter for DataFrame.at_time to support 
columns.
 Key: SPARK-46159
 URL: https://issues.apache.org/jira/browse/SPARK-46159
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at_time.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46150) combine python codegen check and protobuf braking change

2023-11-28 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-46150:
-

Assignee: Ruifeng Zheng

> combine python codegen check and protobuf braking change
> 
>
> Key: SPARK-46150
> URL: https://issues.apache.org/jira/browse/SPARK-46150
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46150) combine python codegen check and protobuf braking change

2023-11-28 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-46150.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44067
[https://github.com/apache/spark/pull/44067]

> combine python codegen check and protobuf braking change
> 
>
> Key: SPARK-46150
> URL: https://issues.apache.org/jira/browse/SPARK-46150
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46158) Improve axis parameter for DataFrame.xs to support columns.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46158:
---

 Summary: Improve axis parameter for DataFrame.xs to support 
columns.
 Key: SPARK-46158
 URL: https://issues.apache.org/jira/browse/SPARK-46158
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.xs.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46157) Add `axis` parameter for DataFrame.aggregate.

2023-11-28 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-46157:
---

 Summary: Add `axis` parameter for DataFrame.aggregate.
 Key: SPARK-46157
 URL: https://issues.apache.org/jira/browse/SPARK-46157
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46154:
---
Labels: pull-request-available  (was: )

> Add a new Daily testing GitHub Action job for Maven with Java 21
> 
>
> Key: SPARK-46154
> URL: https://issues.apache.org/jira/browse/SPARK-46154
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21

2023-11-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-46154:


 Summary: Add a new Daily testing GitHub Action job for Maven with 
Java 21
 Key: SPARK-46154
 URL: https://issues.apache.org/jira/browse/SPARK-46154
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46152) XML: Add DecimalType support in schema inference

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46152:
---
Labels: pull-request-available  (was: )

> XML: Add DecimalType support in schema inference
> 
>
> Key: SPARK-46152
> URL: https://issues.apache.org/jira/browse/SPARK-46152
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46151) Hide the "More" drop-down button in the PySpark docs navigation bar

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46151:
---
Labels: pull-request-available  (was: )

> Hide the "More" drop-down button in the PySpark docs navigation bar
> ---
>
> Key: SPARK-46151
> URL: https://issues.apache.org/jira/browse/SPARK-46151
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46153) XML: Add TimestampNTZType support in schema inference

2023-11-28 Thread Sandip Agarwala (Jira)

Sandip Agarwala created SPARK-46153:
---

 Summary: XML: Add TimestampNTZType support in schema inference
 Key: SPARK-46153
 URL: https://issues.apache.org/jira/browse/SPARK-46153
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46152) XML: Add DecimalType support in schema inference

2023-11-28 Thread Sandip Agarwala (Jira)

Sandip Agarwala created SPARK-46152:
---

 Summary: XML: Add DecimalType support in schema inference
 Key: SPARK-46152
 URL: https://issues.apache.org/jira/browse/SPARK-46152
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46151) Hide the "More" drop-down button in the PySpark docs navigation bar

2023-11-28 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-46151:
---

 Summary: Hide the "More" drop-down button in the PySpark docs 
navigation bar
 Key: SPARK-46151
 URL: https://issues.apache.org/jira/browse/SPARK-46151
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46150) combine python codegen check and protobuf braking change

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46150:
---
Labels: pull-request-available  (was: )

> combine python codegen check and protobuf braking change
> 
>
> Key: SPARK-46150
> URL: https://issues.apache.org/jira/browse/SPARK-46150
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46150) combine python codegen check and protobuf braking change

2023-11-28 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-46150:
-

 Summary: combine python codegen check and protobuf braking change
 Key: SPARK-46150
 URL: https://issues.apache.org/jira/browse/SPARK-46150
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46149:


Assignee: Hyukjin Kwon

> Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 
> 3.12
> --
>
> Key: SPARK-46149
> URL: https://issues.apache.org/jira/browse/SPARK-46149
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> {code}
> ==
> ERROR [12.635s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> ==
> ERROR [14.850s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> --
> {code}
> https://github.com/apache/spark/actions/runs/7020654429/job/19100964890



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46149.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44065
[https://github.com/apache/spark/pull/44065]

> Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 
> 3.12
> --
>
> Key: SPARK-46149
> URL: https://issues.apache.org/jira/browse/SPARK-46149
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code}
> ==
> ERROR [12.635s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> ==
> ERROR [14.850s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> --
> {code}
> https://github.com/apache/spark/actions/runs/7020654429/job/19100964890



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46148.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44064
[https://github.com/apache/spark/pull/44064]

> Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
> ---
>
> Key: SPARK-46148
> URL: https://issues.apache.org/jira/browse/SPARK-46148
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code}
> **
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.10/doctest.py", line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 213, in toPandas
> rows = self.collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", 
> line 188, in deco
> raise converted from None
> pyspark.errors.exceptions.captured.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1523, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1515, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 485, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 101, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 478, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1284, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1619, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1383, in _predict_row_batch
> result = predict_fn(pdf, params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1601, in batch_predict_fn
> return loaded_model.predict(pdf, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 491, in predict
> return _predict()
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 477, in _predict
> return self._predict_fn(data, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 
> 517, in predict
> return self.sklearn_model.predict(data)
>   File 
> "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
> 386, in predict
> return self._decision_function(X)

[jira] [Assigned] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46148:


Assignee: Hyukjin Kwon

> Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
> ---
>
> Key: SPARK-46148
> URL: https://issues.apache.org/jira/browse/SPARK-46148
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> **
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.10/doctest.py", line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 213, in toPandas
> rows = self.collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", 
> line 188, in deco
> raise converted from None
> pyspark.errors.exceptions.captured.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1523, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1515, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 485, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 101, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 478, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1284, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1619, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1383, in _predict_row_batch
> result = predict_fn(pdf, params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1601, in batch_predict_fn
> return loaded_model.predict(pdf, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 491, in predict
> return _predict()
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 477, in _predict
> return self._predict_fn(data, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 
> 517, in predict
> return self.sklearn_model.predict(data)
>   File 
> "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
> 386, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
> 369, in _decision_function
>

[jira] [Resolved] (SPARK-46146) Unpin `markupsafe`

2023-11-28 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-46146.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44062
[https://github.com/apache/spark/pull/44062]

> Unpin `markupsafe`
> --
>
> Key: SPARK-46146
> URL: https://issues.apache.org/jira/browse/SPARK-46146
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46003) Create an ui-test module with Jest

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46003:
-

Assignee: Kent Yao

> Create an ui-test module with Jest
> --
>
> Key: SPARK-46003
> URL: https://issues.apache.org/jira/browse/SPARK-46003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests, UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46003) Create an ui-test module with Jest

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46003.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43903
[https://github.com/apache/spark/pull/43903]

> Create an ui-test module with Jest
> --
>
> Key: SPARK-46003
> URL: https://issues.apache.org/jira/browse/SPARK-46003
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests, UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46147.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44063
[https://github.com/apache/spark/pull/44063]

> Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
> -
>
> Key: SPARK-46147
> URL: https://issues.apache.org/jira/browse/SPARK-46147
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in 
> pyspark.pandas.series.Series.to_dict
> Failed example:
> s.to_dict(OrderedDict)
> Expected:
> OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
> Got:
> OrderedDict({0: 1, 1: 2, 2: 3, 3: 4})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46140) Remove comments about JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46140:
--
Summary: Remove comments about JVM module options from the test submission 
options of `HiveExternalCatalogVersionsSuite`  (was: Remove no longer needed 
JVM module options from the test submission options of 
`HiveExternalCatalogVersionsSuite`)

> Remove comments about JVM module options from the test submission options of 
> `HiveExternalCatalogVersionsSuite`
> ---
>
> Key: SPARK-46140
> URL: https://issues.apache.org/jira/browse/SPARK-46140
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Complete TODO:
> {code:java}
> val args = Seq(
>   "--name", "prepare testing tables",
>   "--master", "local[2]",
>   "--conf", s"${UI_ENABLED.key}=false",
>   "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven",
>   "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}",
>   "--conf", s"spark.sql.test.version.index=$index",
>   "--driver-java-options", 
> s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " +
> // TODO SPARK-37159 Consider to remove the following
> // JVM module options once the Spark 3.2 line is EOL.
> JavaModuleOptions.defaultModuleOptions(),
>   tempPyFile.getCanonicalPath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46140:
-

Assignee: Yang Jie

> Remove no longer needed JVM module options from the test submission options 
> of `HiveExternalCatalogVersionsSuite`
> -
>
> Key: SPARK-46140
> URL: https://issues.apache.org/jira/browse/SPARK-46140
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> Complete TODO:
> {code:java}
> val args = Seq(
>   "--name", "prepare testing tables",
>   "--master", "local[2]",
>   "--conf", s"${UI_ENABLED.key}=false",
>   "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven",
>   "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}",
>   "--conf", s"spark.sql.test.version.index=$index",
>   "--driver-java-options", 
> s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " +
> // TODO SPARK-37159 Consider to remove the following
> // JVM module options once the Spark 3.2 line is EOL.
> JavaModuleOptions.defaultModuleOptions(),
>   tempPyFile.getCanonicalPath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46140.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44057
[https://github.com/apache/spark/pull/44057]

> Remove no longer needed JVM module options from the test submission options 
> of `HiveExternalCatalogVersionsSuite`
> -
>
> Key: SPARK-46140
> URL: https://issues.apache.org/jira/browse/SPARK-46140
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Complete TODO:
> {code:java}
> val args = Seq(
>   "--name", "prepare testing tables",
>   "--master", "local[2]",
>   "--conf", s"${UI_ENABLED.key}=false",
>   "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven",
>   "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}",
>   "--conf", s"spark.sql.test.version.index=$index",
>   "--driver-java-options", 
> s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " +
> // TODO SPARK-37159 Consider to remove the following
> // JVM module options once the Spark 3.2 line is EOL.
> JavaModuleOptions.defaultModuleOptions(),
>   tempPyFile.getCanonicalPath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46149:
---
Labels: pull-request-available  (was: )

> Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 
> 3.12
> --
>
> Key: SPARK-46149
> URL: https://issues.apache.org/jira/browse/SPARK-46149
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> {code}
> ==
> ERROR [12.635s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> ==
> ERROR [14.850s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> --
> {code}
> https://github.com/apache/spark/actions/runs/7020654429/job/19100964890



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46149:
-
Description: 
{code}
==
ERROR [12.635s]: test_end_to_end_run_locally 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 403, in test_end_to_end_run_locally
output = TorchDistributor(num_processes=2, local_mode=True, 
use_gpu=False).run(
 
^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in 
run
return self._run(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in 
_run
output = self._run_local_training(
 ^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in 
_run_local_training
output = TorchDistributor._get_output_from_framework_wrapper(
 
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in 
_get_output_from_framework_wrapper
return framework_wrapper(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in 
_run_training_on_pytorch_function
raise RuntimeError(
RuntimeError: TorchDistributor failed during training.View stdout logs for 
detailed error message.

==
ERROR [14.850s]: test_end_to_end_run_locally 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 403, in test_end_to_end_run_locally
output = TorchDistributor(num_processes=2, local_mode=True, 
use_gpu=False).run(
 
^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in 
run
return self._run(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in 
_run
output = self._run_local_training(
 ^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in 
_run_local_training
output = TorchDistributor._get_output_from_framework_wrapper(
 
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in 
_get_output_from_framework_wrapper
return framework_wrapper(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in 
_run_training_on_pytorch_function
raise RuntimeError(
RuntimeError: TorchDistributor failed during training.View stdout logs for 
detailed error message.

--
{code}

https://github.com/apache/spark/actions/runs/7020654429/job/19100964890

  was:
{code}
==
ERROR [12.635s]: test_end_to_end_run_locally 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 403, in test_end_to_end_run_locally
output = TorchDistributor(num_processes=2, local_mode=True, 
use_gpu=False).run(
 
^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in 
run
return self._run(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in 
_run
output = self._run_local_training(
 ^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in 
_run_local_training
output = TorchDistributor._get_output_from_framework_wrapper(
 
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in 
_get_output_from_framework_wrapper
return framework_wrapper(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in 
_run_training_on_pytorch_function

[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46149:
-
Summary: Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` 
with Python 3.12  (was: Skip 
`TorchDistributorLocalUnitTests.test_end_to_end_run_locally`)

> Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 
> 3.12
> --
>
> Key: SPARK-46149
> URL: https://issues.apache.org/jira/browse/SPARK-46149
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> ==
> ERROR [12.635s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> ==
> ERROR [14.850s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46149:
-
Priority: Minor  (was: Major)

> Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`
> -
>
> Key: SPARK-46149
> URL: https://issues.apache.org/jira/browse/SPARK-46149
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> ==
> ERROR [12.635s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> ==
> ERROR [14.850s]: test_end_to_end_run_locally 
> (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
> line 403, in test_end_to_end_run_locally
> output = TorchDistributor(num_processes=2, local_mode=True, 
> use_gpu=False).run(
>  
> ^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, 
> in run
> return self._run(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, 
> in _run
> output = self._run_local_training(
>  ^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, 
> in _run_local_training
> output = TorchDistributor._get_output_from_framework_wrapper(
>  
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, 
> in _get_output_from_framework_wrapper
> return framework_wrapper(
>^^
>   File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, 
> in _run_training_on_pytorch_function
> raise RuntimeError(
> RuntimeError: TorchDistributor failed during training.View stdout logs for 
> detailed error message.
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`

2023-11-28 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-46149:


 Summary: Skip 
`TorchDistributorLocalUnitTests.test_end_to_end_run_locally`
 Key: SPARK-46149
 URL: https://issues.apache.org/jira/browse/SPARK-46149
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


{code}
==
ERROR [12.635s]: test_end_to_end_run_locally 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 403, in test_end_to_end_run_locally
output = TorchDistributor(num_processes=2, local_mode=True, 
use_gpu=False).run(
 
^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in 
run
return self._run(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in 
_run
output = self._run_local_training(
 ^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in 
_run_local_training
output = TorchDistributor._get_output_from_framework_wrapper(
 
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in 
_get_output_from_framework_wrapper
return framework_wrapper(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in 
_run_training_on_pytorch_function
raise RuntimeError(
RuntimeError: TorchDistributor failed during training.View stdout logs for 
detailed error message.

==
ERROR [14.850s]: test_end_to_end_run_locally 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 403, in test_end_to_end_run_locally
output = TorchDistributor(num_processes=2, local_mode=True, 
use_gpu=False).run(
 
^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in 
run
return self._run(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in 
_run
output = self._run_local_training(
 ^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in 
_run_local_training
output = TorchDistributor._get_output_from_framework_wrapper(
 
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in 
_get_output_from_framework_wrapper
return framework_wrapper(
   ^^
  File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in 
_run_training_on_pytorch_function
raise RuntimeError(
RuntimeError: TorchDistributor failed during training.View stdout logs for 
detailed error message.

--
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46148:
-
Description: 
{code}
**
File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
pyspark.pandas.mlflow.load_model
Failed example:
prediction_df
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python3.10/doctest.py", line 1350, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, in 
prediction_df
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in 
__repr__
pdf = cast("DataFrame", 
self._get_or_create_repr_pandas_cache(max_display_count))
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in 
_get_or_create_repr_pandas_cache
self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in 
_to_internal_pandas
return self._internal.to_pandas_frame
  File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in 
wrapped_lazy_property
setattr(self, attr_name, fn(self))
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, in 
to_pandas_frame
pdf = sdf.toPandas()
  File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
213, in toPandas
rows = self.collect()
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in 
collect
sock_info = self._jdf.collectToPython()
  File 
"/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", 
line 188, in deco
raise converted from None
pyspark.errors.exceptions.captured.PythonException: 
  An exception was thrown from the Python worker. Please see the stack 
trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1523, in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1515, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 485, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 101, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 478, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1284, in func
for result_batch, result_type in result_iter:
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1619, in udf
yield _predict_row_batch(batch_predict_fn, row_batch_args)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1383, in _predict_row_batch
result = predict_fn(pdf, params)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1601, in batch_predict_fn
return loaded_model.predict(pdf, params=params)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 491, in predict
return _predict()
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 477, in _predict
return self._predict_fn(data, params=params)
  File 
"/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 517, 
in predict
return self.sklearn_model.predict(data)
  File 
"/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
386, in predict
return self._decision_function(X)
  File 
"/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
369, in _decision_function
X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
reset=False)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 580, 
in _validate_data
self._check_feature_names(X, reset=reset)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 507, 
in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during 
fit.
Feature names unseen at fit time:
- 0
- 1
Feature names seen at fit time, yet now missing:
- x1
- x2



JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
in stage 1.0 failed 1 times, most recent

[jira] [Updated] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46148:
---
Labels: pull-request-available  (was: )

> Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
> ---
>
> Key: SPARK-46148
> URL: https://issues.apache.org/jira/browse/SPARK-46148
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> **
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.10/doctest.py", line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 213, in toPandas
> rows = self.collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", 
> line 188, in deco
> raise converted from None
> pyspark.errors.exceptions.captured.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1523, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1515, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 485, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 101, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 478, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 1284, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1619, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1383, in _predict_row_batch
> result = predict_fn(pdf, params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1601, in batch_predict_fn
> return loaded_model.predict(pdf, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 491, in predict
> return _predict()
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 
> 477, in _predict
> return self._predict_fn(data, params=params)
>   File 
> "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 
> 517, in predict
> return self.sklearn_model.predict(data)
>   File 
> "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
> 386, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
> 369, in _decision_function
> X =

[jira] [Created] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-46148:


 Summary: Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
 Key: SPARK-46148
 URL: https://issues.apache.org/jira/browse/SPARK-46148
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


{code}
**
File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
pyspark.pandas.mlflow.load_model
Failed example:
prediction_df
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python3.10/doctest.py", line 1350, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, in 
prediction_df
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in 
__repr__
pdf = cast("DataFrame", 
self._get_or_create_repr_pandas_cache(max_display_count))
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in 
_get_or_create_repr_pandas_cache
self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in 
_to_internal_pandas
return self._internal.to_pandas_frame
  File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in 
wrapped_lazy_property
setattr(self, attr_name, fn(self))
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, in 
to_pandas_frame
pdf = sdf.toPandas()
  File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
213, in toPandas
rows = self.collect()
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in 
collect
sock_info = self._jdf.collectToPython()
  File 
"/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", 
line 188, in deco
raise converted from None
pyspark.errors.exceptions.captured.PythonException: 
  An exception was thrown from the Python worker. Please see the stack 
trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1523, in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1515, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 485, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 101, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 478, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1284, in func
for result_batch, result_type in result_iter:
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1619, in udf
yield _predict_row_batch(batch_predict_fn, row_batch_args)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1383, in _predict_row_batch
result = predict_fn(pdf, params)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 1601, in batch_predict_fn
return loaded_model.predict(pdf, params=params)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 491, in predict
return _predict()
  File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", 
line 477, in _predict
return self._predict_fn(data, params=params)
  File 
"/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 517, 
in predict
return self.sklearn_model.predict(data)
  File 
"/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
386, in predict
return self._decision_function(X)
  File 
"/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 
369, in _decision_function
X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
reset=False)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 580, 
in _validate_data
self._check_feature_names(X, reset=reset)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 507, 
in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during 
fit.
Feature names unseen at fit time:
- 0
- 1
Feature names seen

[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46147:
---
Labels: pull-request-available  (was: )

> Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
> -
>
> Key: SPARK-46147
> URL: https://issues.apache.org/jira/browse/SPARK-46147
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in 
> pyspark.pandas.series.Series.to_dict
> Failed example:
> s.to_dict(OrderedDict)
> Expected:
> OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
> Got:
> OrderedDict({0: 1, 1: 2, 2: 3, 3: 4})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46147:
-
Fix Version/s: (was: 4.0.0)

> Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
> -
>
> Key: SPARK-46147
> URL: https://issues.apache.org/jira/browse/SPARK-46147
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in 
> pyspark.pandas.frame.DataFrame.to_dict
> Failed example:
> df.to_dict(into=OrderedDict)
> Expected:
> OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', 
> OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
> Got:
> OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': 
> OrderedDict({'row1': 0.5, 'row2': 0.75})})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46147:
-
Labels:   (was: pull-request-available)

> Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
> -
>
> Key: SPARK-46147
> URL: https://issues.apache.org/jira/browse/SPARK-46147
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 4.0.0
>
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in 
> pyspark.pandas.frame.DataFrame.to_dict
> Failed example:
> df.to_dict(into=OrderedDict)
> Expected:
> OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', 
> OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
> Got:
> OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': 
> OrderedDict({'row1': 0.5, 'row2': 0.75})})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46147:
-
Description: 
{code}
File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in 
pyspark.pandas.series.Series.to_dict
Failed example:
s.to_dict(OrderedDict)
Expected:
OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
Got:
OrderedDict({0: 1, 1: 2, 2: 3, 3: 4})
{code}

  was:
{code}
File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in 
pyspark.pandas.frame.DataFrame.to_dict
Failed example:
df.to_dict(into=OrderedDict)
Expected:
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', 
OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Got:
OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': 
OrderedDict({'row1': 0.5, 'row2': 0.75})})
{code}


> Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
> -
>
> Key: SPARK-46147
> URL: https://issues.apache.org/jira/browse/SPARK-46147
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in 
> pyspark.pandas.series.Series.to_dict
> Failed example:
> s.to_dict(OrderedDict)
> Expected:
> OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
> Got:
> OrderedDict({0: 1, 1: 2, 2: 3, 3: 4})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)

2023-11-28 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-46147:


 Summary: Fix the doctest in pyspark.pandas.series.Series.to_dict 
(Python 3.12)
 Key: SPARK-46147
 URL: https://issues.apache.org/jira/browse/SPARK-46147
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon
 Fix For: 4.0.0


{code}
File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in 
pyspark.pandas.frame.DataFrame.to_dict
Failed example:
df.to_dict(into=OrderedDict)
Expected:
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', 
OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Got:
OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': 
OrderedDict({'row1': 0.5, 'row2': 0.75})})
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46146) Unpin `markupsafe`

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46146:
---
Labels: pull-request-available  (was: )

> Unpin `markupsafe`
> --
>
> Key: SPARK-46146
> URL: https://issues.apache.org/jira/browse/SPARK-46146
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46146) Unpin `markupsafe`

2023-11-28 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-46146:
-

 Summary: Unpin `markupsafe`
 Key: SPARK-46146
 URL: https://issues.apache.org/jira/browse/SPARK-46146
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46029) Escape the single quote, _ and % for DS V2 pushdown

2023-11-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46029.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43801
[https://github.com/apache/spark/pull/43801]

> Escape the single quote, _ and % for DS V2 pushdown
> ---
>
> Key: SPARK-46029
> URL: https://issues.apache.org/jira/browse/SPARK-46029
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> Spark supports push down startsWith, endWith and contains to JDBC database 
> with DS V2 pushdown.
> But the V2ExpressionSQLBuilder didn't escape the single quote, _ and %, it 
> can cause unexpected result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46108) XML: keepInnerXmlAsRaw option

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46108:
---
Labels: pull-request-available  (was: )

> XML: keepInnerXmlAsRaw option
> -
>
> Key: SPARK-46108
> URL: https://issues.apache.org/jira/browse/SPARK-46108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ufuk Süngü
>Priority: Major
>  Labels: pull-request-available
>
> Built-in XML data source gives related value and schema of the inner or 
> nested elements. However, additional operations should be made by developers 
> manually to convert unstructured data to structured, tabular format. If 
> nested elements are kept in a format that is suitable with XML (for each 
> level), we can convert them easily to a structured, tabular format with the 
> existing methods that have already been developed (infer method of 
> XmlInferSchema and parseColumn method of StaxXmlParser). Therefore there 
> should be an option that affects StaxXmlParser and InferSchema classes to 
> keep inner XML elements in their original or raw format.
> https://github.com/apache/spark/pull/44022



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46055) Refactor Catalog Database APIs implementation

2023-11-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46055.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43959
[https://github.com/apache/spark/pull/43959]

> Refactor Catalog Database APIs implementation
> -
>
> Key: SPARK-46055
> URL: https://issues.apache.org/jira/browse/SPARK-46055
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46142) Remove `dev/ansible-for-test-node` directory

2023-11-28 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-46142:


Assignee: Dongjoon Hyun

> Remove `dev/ansible-for-test-node` directory
> 
>
> Key: SPARK-46142
> URL: https://issues.apache.org/jira/browse/SPARK-46142
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46142) Remove `dev/ansible-for-test-node` directory

2023-11-28 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-46142.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44059
[https://github.com/apache/spark/pull/44059]

> Remove `dev/ansible-for-test-node` directory
> 
>
> Key: SPARK-46142
> URL: https://issues.apache.org/jira/browse/SPARK-46142
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46145) spark.catalog.listTables does not throw exception when the table or view is not found

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46145:
---
Labels: pull-request-available  (was: )

> spark.catalog.listTables does not throw exception when the table or view is 
> not found
> -
>
> Key: SPARK-46145
> URL: https://issues.apache.org/jira/browse/SPARK-46145
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46145) spark.catalog.listTables does not throw exception when the table or view is not found

2023-11-28 Thread Rui Wang (Jira)

Rui Wang created SPARK-46145:


 Summary: spark.catalog.listTables does not throw exception when 
the table or view is not found
 Key: SPARK-46145
 URL: https://issues.apache.org/jira/browse/SPARK-46145
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46144:
---
Labels: pull-request-available  (was: )

> Fail INSERT INTO ... REPLACE statement if the condition contains subquery
> -
>
> Key: SPARK-46144
> URL: https://issues.apache.org/jira/browse/SPARK-46144
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>
> For the following query:
> {code:java}
> INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT 
> * FROM source{code}
> There will be an analysis error:
> {code:java}
> [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function 
> parameter with name `c2` cannot be resolved.  SQLSTATE: 42703; line 1 pos 51;
> 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, 
> (id#27L = scalar-subquery#26 []), false {code}
> The error message is confusing. The actually reason is the 
> OverwriteByExpression plan doesn't support subqueries. While supporting the 
> feature is non-trivial, we should improve the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery

2023-11-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-46144:
---
Description: 
For the following query:
{code:java}
INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * 
FROM source{code}
There will be an analysis error:
{code:java}
[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function 
parameter with name `c2` cannot be resolved.  SQLSTATE: 42703; line 1 pos 51;
'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, 
(id#27L = scalar-subquery#26 []), false {code}
The error message is confusing. The actually reason is the 
OverwriteByExpression plan doesn't support subqueries. While supporting the 
feature is non-trivial, we should improve the error message.

  was:
For the following query:
{code:java}
INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * 
FROM source{code}
There will be analysis 


> Fail INSERT INTO ... REPLACE statement if the condition contains subquery
> -
>
> Key: SPARK-46144
> URL: https://issues.apache.org/jira/browse/SPARK-46144
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For the following query:
> {code:java}
> INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT 
> * FROM source{code}
> There will be an analysis error:
> {code:java}
> [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function 
> parameter with name `c2` cannot be resolved.  SQLSTATE: 42703; line 1 pos 51;
> 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, 
> (id#27L = scalar-subquery#26 []), false {code}
> The error message is confusing. The actually reason is the 
> OverwriteByExpression plan doesn't support subqueries. While supporting the 
> feature is non-trivial, we should improve the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery

2023-11-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-46144:
---
Description: 
For the following query:
{code:java}
INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * 
FROM source{code}
There will be analysis 

> Fail INSERT INTO ... REPLACE statement if the condition contains subquery
> -
>
> Key: SPARK-46144
> URL: https://issues.apache.org/jira/browse/SPARK-46144
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For the following query:
> {code:java}
> INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT 
> * FROM source{code}
> There will be analysis 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery

2023-11-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-46144:
---
Affects Version/s: 4.0.0
   (was: 3.5.0)

> Fail INSERT INTO ... REPLACE statement if the condition contains subquery
> -
>
> Key: SPARK-46144
> URL: https://issues.apache.org/jira/browse/SPARK-46144
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery

2023-11-28 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-46144:
--

 Summary: Fail INSERT INTO ... REPLACE statement if the condition 
contains subquery
 Key: SPARK-46144
 URL: https://issues.apache.org/jira/browse/SPARK-46144
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46125) Memory leak when using createDataFrame with persist

2023-11-28 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790751#comment-17790751
 ] 

Josh Rosen commented on SPARK-46125:


I think that this issue relates specifically to `createDataFrame` and other 
mechanisms for creating Datasets or RDDs from driver-side data.

I was able to reproduce the memory effects that you reported using a synthetic 
dataset:
{code:java}
n_rows = 100
data = np.random.randn(n_rows, n_cols)
pdf = pd.DataFrame(data, columns=[f'Column_{i}' for i in range(n_cols)])
{code}
I took heap dumps in the "with unpersist" and "without unpersist" cases and saw 
that most of the difference was due to `byte[]` arrays. That, in turn, is due 
to ParallelCollectionPartitions being kept alive in a ParallelCollectionRDD 
that is retained by the CacheManager.

When you cache a query, Spark keeps the physical query plan alive so that it 
can recompute cached data if it is lost (e.g. due to a node failure). For 
Datasets or RDDs that are created from data on the driver, that driver-side 
data is kept alive.

It's this CacheManager reference to the physical plan which is keeping the 
source RDD from being cleaned: this is why `del df` followed by GC does not 
clean up the RDD's memory.

---

If you use `localCheckpoint` then Spark will persist the data to disk and 
truncate the RDD lineage, thereby avoiding driver-side memory consumption from 
the parallel collection RDD, but this will have the side effect of removing 
fault-tolerance: if any node is lost then the data will be lost and any 
attempts to access it will result in query failures.

!image-2023-11-28-12-55-58-461.png!

> Memory leak when using createDataFrame with persist
> ---
>
> Key: SPARK-46125
> URL: https://issues.apache.org/jira/browse/SPARK-46125
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 3.5.0
>Reporter: Arman Yazdani
>Priority: Major
>  Labels: PySpark, memory-leak, persist
> Attachments: CreateDataFrameWithUnpersist.png, 
> CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png, 
> image-2023-11-28-12-55-58-461.png
>
>
> When I create a dataset from pandas data frame and persisting it (DISK_ONLY), 
> some "byte[]" objects (total size of imported data frame) will still remain 
> in the driver's heap memory.
> This is the sample code for reproducing it:
> {code:python}
> import pandas as pd
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> pdf = pd.read_pickle('tmp/input.pickle')
> df = spark.createDataFrame(pdf)
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del pdf
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> After running this code, I will perform a manual GC in VisualVM, but the 
> driver memory usage will remain at 550 MBs (at start it was about 50 MBs).
> !CreateDataFrameWithoutUnpersist.png|width=467,height=349!
> Then I tested with adding {{"df = df.unpersist()"}} after the 
> {{"df.count()"}} line and everything was OK (Memory usage after performing 
> manual GC was about 50 MBs).
> !CreateDataFrameWithUnpersist.png|width=468,height=300!
> Also, I tried with reading from parquet file (without adding unpersist line) 
> with this code:
> {code:python}
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> df = spark.read.parquet('tmp/input.parquet')
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> Again everything was fine and memory usage was about 50 MBs after performing 
> manual GC.
> !ReadParquetWithoutUnpersist.png|width=473,height=302!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46125) Memory leak when using createDataFrame with persist

2023-11-28 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-46125:
---
Attachment: image-2023-11-28-12-55-58-461.png

> Memory leak when using createDataFrame with persist
> ---
>
> Key: SPARK-46125
> URL: https://issues.apache.org/jira/browse/SPARK-46125
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 3.5.0
>Reporter: Arman Yazdani
>Priority: Major
>  Labels: PySpark, memory-leak, persist
> Attachments: CreateDataFrameWithUnpersist.png, 
> CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png, 
> image-2023-11-28-12-55-58-461.png
>
>
> When I create a dataset from pandas data frame and persisting it (DISK_ONLY), 
> some "byte[]" objects (total size of imported data frame) will still remain 
> in the driver's heap memory.
> This is the sample code for reproducing it:
> {code:python}
> import pandas as pd
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> pdf = pd.read_pickle('tmp/input.pickle')
> df = spark.createDataFrame(pdf)
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del pdf
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> After running this code, I will perform a manual GC in VisualVM, but the 
> driver memory usage will remain at 550 MBs (at start it was about 50 MBs).
> !CreateDataFrameWithoutUnpersist.png|width=467,height=349!
> Then I tested with adding {{"df = df.unpersist()"}} after the 
> {{"df.count()"}} line and everything was OK (Memory usage after performing 
> manual GC was about 50 MBs).
> !CreateDataFrameWithUnpersist.png|width=468,height=300!
> Also, I tried with reading from parquet file (without adding unpersist line) 
> with this code:
> {code:python}
> import gc
> from pyspark.sql import SparkSession
> from pyspark.storagelevel import StorageLevel
> spark = SparkSession.builder \
>     .config("spark.driver.memory", "4g") \
>     .config("spark.executor.memory", "4g") \
>     .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
>     .getOrCreate()
> df = spark.read.parquet('tmp/input.parquet')
> df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
> df.count()
> del df
> gc.collect()
> spark.sparkContext._jvm.System.gc(){code}
> Again everything was fine and memory usage was about 50 MBs after performing 
> manual GC.
> !ReadParquetWithoutUnpersist.png|width=473,height=302!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above

2023-11-28 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790742#comment-17790742
 ] 

Josh Rosen commented on SPARK-46105:


{quote}The reason for raising this as a bug is I have a scenario where my final 
dataframe returns 0 records in EKS(local spark) with single node(driver and 
executor on the sam node) but it returns 1 in EMR both uses a same spark 
version 3.3.3.
{quote}
To clarify: by "returns 0 records", are you referring to the record count of 
the data frame (i.e. whether isEmpty returns true or false) or to the partition 
count? In other words, are you saying that EMR returns an incorrect record 
count or do you mean that it returns an unexpected partition count?

> df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
> ---
>
> Key: SPARK-46105
> URL: https://issues.apache.org/jira/browse/SPARK-46105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.3
> Environment: EKS
> EMR
>Reporter: dharani_sugumar
>Priority: Major
> Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png
>
>
> {color:#FF}Version: 3.3.3{color}
>  
> {color:#FF}scala> val df = spark.emptyDataFrame{color}
> {color:#FF}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF}scala> df.rdd.getNumPartitions{color}
> {color:#FF}res0: Int = 0{color}
> {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF}res1: Int = 1{color}
> {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
> Version: 3.2.4
> scala> val df = spark.emptyDataFrame
> df: org.apache.spark.sql.DataFrame = []
> scala> df.rdd.getNumPartitions
> res0: Int = 0
> scala> df.repartition(1).rdd.getNumPartitions
> res1: Int = 0
> scala> df.repartition(1).rdd.isEmpty()
> res2: Boolean = true
>  
> {color:#FF}Version: 3.5.0{color}
> {color:#FF}scala> val df = spark.emptyDataFrame{color}
> {color:#FF}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF}scala> df.rdd.getNumPartitions{color}
> {color:#FF}res0: Int = 0{color}
> {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF}res1: Int = 1{color}
> {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
>  
> When we do repartition of 1 on an empty dataframe, the resultant partition is 
> 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the 
> resultant partition is 0. May i know why this behaviour is changed from 3.2.x 
> to higher versions. 
>  
> The reason for raising this as a bug is I have a scenario where my final 
> dataframe returns 0 records in EKS(local spark) with single node(driver and 
> executor on the sam node) but it returns 1 in EMR both uses a same spark 
> version 3.3.3. I'm not sure why this behaves different in both the 
> environments. As a interim solution, I had to repartition a empty dataframe 
> if my final dataframe is empty which returns 1 for 3.3.3. Would like to know 
> if this really a bug or this behaviour exists in the future versions and 
> cannot be changed?
>  
> Because, If we go for a spark upgrade and this behaviour is changed, we will 
> face the issue again. 
> Please confirm on this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Environment: 
pyspark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 

  was:
Apache spark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 


> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !image-2023-11-28-13-20-40-275.png!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !image-2023-11-28-13-20-51-291.png!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Description: 
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!image-2023-11-28-13-20-40-275.png!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!image-2023-11-28-13-20-51-291.png!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.3 or try the latest 3.5.0 which is not the case for fabric

 

 

  was:
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!image-2023-11-28-13-20-40-275.png!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!image-2023-11-28-13-20-51-291.png!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 


> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
>

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Attachment: image-2023-11-28-13-20-51-291.png

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !Z!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !9k=!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Attachment: image-2023-11-28-13-20-40-275.png

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !Z!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !9k=!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Description: 
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!image-2023-11-28-13-20-40-275.png!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!image-2023-11-28-13-20-51-291.png!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 

  was:
Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!Z!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!9k=!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 


> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a

[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matheus Pavanetti updated SPARK-46143:
--
Attachment: MicrosoftTeams-image.png

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: Apache spark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace.
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !Z!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !9k=!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2023-11-28 Thread Matheus Pavanetti (Jira)

Matheus Pavanetti created SPARK-46143:
-

 Summary: pyspark.pandas read_excel implementation at version 3.4.1
 Key: SPARK-46143
 URL: https://issues.apache.org/jira/browse/SPARK-46143
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.1
 Environment: Apache spark 3.4.1.5.3 build 20230713.

Running on Microsoft Fabric workspace.

 

 
Reporter: Matheus Pavanetti


Hello, 

I would like to report an issue with pyspark.pandas implementation on 
read_excel function.

Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
potentially uses an older version of pandas on it's implementations of 
pyspark.pandas.

The function read_excel from pandas doesn't expect a parameter called "squeeze" 
however it's implemented as part of pyspark.pandas and the parameter "squeeze" 
is being passed to the pandas function.

 

!Z!

 

I've been digging into it for further investigation into pyspark 3.4.1 
documentation

[https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]

 

This is the point I found that "squeeze" parameter is being passed to pandas 
read_excel function which is not expected.

It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
in the implementation.

 

!9k=!

 

I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
with fabric. However fabric uses this version as its 1.2 build.

 

I am able to work around that for now by download the excel from the one lake 
to the spark driver, loading that to the memory with pandas and then converting 
to a spark dataframe etc or I made it work downgrading the build

I downloaded the pyspark build 20230713 to my local, made the changes and 
re-compiled it and it worked locally. So it means that is related to the 
implementation and they would have to fix or I do a downgrade to older version 
like 3.3.0 or try the latest 3.5.0 which is not the case for fabric

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46134) Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46134.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44048
[https://github.com/apache/spark/pull/44048]

> Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`
> ---
>
> Key: SPARK-46134
> URL: https://issues.apache.org/jira/browse/SPARK-46134
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In `ProtobufFunctionsSuite`, there are some cases where `df.take(1).toSeq(0)` 
> is used to get the first row in the DataFrame. This can be achieved by using 
> the `.first()` API, which looks clearer and more concise.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46134) Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46134:
-

Assignee: Yang Jie

> Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`
> ---
>
> Key: SPARK-46134
> URL: https://issues.apache.org/jira/browse/SPARK-46134
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> In `ProtobufFunctionsSuite`, there are some cases where `df.take(1).toSeq(0)` 
> is used to get the first row in the DataFrame. This can be achieved by using 
> the `.first()` API, which looks clearer and more concise.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46142) Remove `dev/ansible-for-test-node` directory

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46142:
---
Labels: pull-request-available  (was: )

> Remove `dev/ansible-for-test-node` directory
> 
>
> Key: SPARK-46142
> URL: https://issues.apache.org/jira/browse/SPARK-46142
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46142) Remove `dev/ansible-for-test-node` directory

2023-11-28 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-46142:
-

 Summary: Remove `dev/ansible-for-test-node` directory
 Key: SPARK-46142
 URL: https://issues.apache.org/jira/browse/SPARK-46142
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length

2023-11-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-46098:
-
Priority: Minor  (was: Major)

> Reduce stack depth by replace (string|array).size with (string|array).length
> 
>
> Key: SPARK-46098
> URL: https://issues.apache.org/jira/browse/SPARK-46098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Minor
> Fix For: 4.0.0
>
>
> There are a lot of (string|array).size called.
> In fact, the size calls the underlying length, this behavior increase the 
> stack length.
> We should call (string|array).length directly.
> We also get the compile waring Replace .size with .length on arrays and 
> strings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length

2023-11-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-46098.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Reduce stack depth by replace (string|array).size with (string|array).length
> 
>
> Key: SPARK-46098
> URL: https://issues.apache.org/jira/browse/SPARK-46098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
> Fix For: 4.0.0
>
>
> There are a lot of (string|array).size called.
> In fact, the size calls the underlying length, this behavior increase the 
> stack length.
> We should call (string|array).length directly.
> We also get the compile waring Replace .size with .length on arrays and 
> strings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46101) Replace (string|array).size with (string|array).length in all the modules

2023-11-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-46101.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44016
[https://github.com/apache/spark/pull/44016]

> Replace (string|array).size with (string|array).length in all the modules
> -
>
> Key: SPARK-46101
> URL: https://issues.apache.org/jira/browse/SPARK-46101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46141) Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to CORRECTED

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46141:
---
Labels: pull-request-available  (was: )

> Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to 
> CORRECTED
> --
>
> Key: SPARK-46141
> URL: https://issues.apache.org/jira/browse/SPARK-46141
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> spark.sql.legacy.ctePrecedencePolicy has been around for years and is 
> defaulted to EXCEPTION.
> It is high time that we change it to corrected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46129) Add GitHub link icon to PySpark documentation header

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46129:
-

Assignee: Haejoon Lee

> Add GitHub link icon to PySpark documentation header
> 
>
> Key: SPARK-46129
> URL: https://issues.apache.org/jira/browse/SPARK-46129
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Add GitHub link icon to PySpark documentation header for better accessibility 
> such as Pandas does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46129) Add GitHub link icon to PySpark documentation header

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46129.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44043
[https://github.com/apache/spark/pull/44043]

> Add GitHub link icon to PySpark documentation header
> 
>
> Key: SPARK-46129
> URL: https://issues.apache.org/jira/browse/SPARK-46129
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add GitHub link icon to PySpark documentation header for better accessibility 
> such as Pandas does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46119) Override toString method for UnresolvedAlias

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46119:
-

Assignee: Yuming Wang

> Override toString method for UnresolvedAlias
> 
>
> Key: SPARK-46119
> URL: https://issues.apache.org/jira/browse/SPARK-46119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46119) Override toString method for UnresolvedAlias

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46119.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44038
[https://github.com/apache/spark/pull/44038]

> Override toString method for UnresolvedAlias
> 
>
> Key: SPARK-46119
> URL: https://issues.apache.org/jira/browse/SPARK-46119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46139:
-

Assignee: Yang Jie

> Fix QueryExecutionErrorsSuite with Java 21
> --
>
> Key: SPARK-46139
> URL: https://issues.apache.org/jira/browse/SPARK-46139
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487]
> {code:java}
> [info] - FAILED_EXECUTE_UDF: execute user defined function with registered 
> UDF *** FAILED *** (42 milliseconds)
> 15247[info]   java.lang.IllegalArgumentException: For parameter 'reason' 
> value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds 
> for length 5' does not match: java.lang.StringIndexOutOfBoundsException: 
> begin 5, end 6, length 5
> 15248[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357)
> 15249[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352)
> 15250[info]   at 
> scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
> 15251[info]   at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
> 15252[info]   at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21

2023-11-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46139.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44056
[https://github.com/apache/spark/pull/44056]

> Fix QueryExecutionErrorsSuite with Java 21
> --
>
> Key: SPARK-46139
> URL: https://issues.apache.org/jira/browse/SPARK-46139
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487]
> {code:java}
> [info] - FAILED_EXECUTE_UDF: execute user defined function with registered 
> UDF *** FAILED *** (42 milliseconds)
> 15247[info]   java.lang.IllegalArgumentException: For parameter 'reason' 
> value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds 
> for length 5' does not match: java.lang.StringIndexOutOfBoundsException: 
> begin 5, end 6, length 5
> 15248[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357)
> 15249[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352)
> 15250[info]   at 
> scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
> 15251[info]   at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
> 15252[info]   at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46141) Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to CORRECTED

2023-11-28 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-46141:


 Summary: Change default of spark.sql.legacy.ctePrecedencePolicy 
from EXCEPTION to CORRECTED
 Key: SPARK-46141
 URL: https://issues.apache.org/jira/browse/SPARK-46141
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Serge Rielau


spark.sql.legacy.ctePrecedencePolicy has been around for years and is defaulted 
to EXCEPTION.
It is high time that we change it to corrected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-11-28 Thread jeanlyn (Jira)



[ https://issues.apache.org/jira/browse/SPARK-43106 ]


jeanlyn deleted comment on SPARK-43106:
-

was (Author: jeanlyn):
I think we also encountered similar problems, we circumvent this problem by 
using parameters *spark.sql.hive.convertInsertingPartitionedTable=false*

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46140:
---
Labels: pull-request-available  (was: )

> Remove no longer needed JVM module options from the test submission options 
> of `HiveExternalCatalogVersionsSuite`
> -
>
> Key: SPARK-46140
> URL: https://issues.apache.org/jira/browse/SPARK-46140
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> Complete TODO:
> {code:java}
> val args = Seq(
>   "--name", "prepare testing tables",
>   "--master", "local[2]",
>   "--conf", s"${UI_ENABLED.key}=false",
>   "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion",
>   "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven",
>   "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}",
>   "--conf", s"spark.sql.test.version.index=$index",
>   "--driver-java-options", 
> s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " +
> // TODO SPARK-37159 Consider to remove the following
> // JVM module options once the Spark 3.2 line is EOL.
> JavaModuleOptions.defaultModuleOptions(),
>   tempPyFile.getCanonicalPath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`

2023-11-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-46140:


 Summary: Remove no longer needed JVM module options from the test 
submission options of `HiveExternalCatalogVersionsSuite`
 Key: SPARK-46140
 URL: https://issues.apache.org/jira/browse/SPARK-46140
 Project: Spark
  Issue Type: Task
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Yang Jie


Complete TODO:
{code:java}
val args = Seq(
  "--name", "prepare testing tables",
  "--master", "local[2]",
  "--conf", s"${UI_ENABLED.key}=false",
  "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false",
  "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion",
  "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven",
  "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}",
  "--conf", s"spark.sql.test.version.index=$index",
  "--driver-java-options", 
s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " +
// TODO SPARK-37159 Consider to remove the following
// JVM module options once the Spark 3.2 line is EOL.
JavaModuleOptions.defaultModuleOptions(),
  tempPyFile.getCanonicalPath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46139:
---
Labels: pull-request-available  (was: )

> Fix QueryExecutionErrorsSuite with Java 21
> --
>
> Key: SPARK-46139
> URL: https://issues.apache.org/jira/browse/SPARK-46139
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487]
> {code:java}
> [info] - FAILED_EXECUTE_UDF: execute user defined function with registered 
> UDF *** FAILED *** (42 milliseconds)
> 15247[info]   java.lang.IllegalArgumentException: For parameter 'reason' 
> value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds 
> for length 5' does not match: java.lang.StringIndexOutOfBoundsException: 
> begin 5, end 6, length 5
> 15248[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357)
> 15249[info]   at 
> org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352)
> 15250[info]   at 
> scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
> 15251[info]   at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
> 15252[info]   at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21

2023-11-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-46139:


 Summary: Fix QueryExecutionErrorsSuite with Java 21
 Key: SPARK-46139
 URL: https://issues.apache.org/jira/browse/SPARK-46139
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Yang Jie


[https://github.com/apache/spark/actions/runs/7014008773/job/19081075487]
{code:java}
[info] - FAILED_EXECUTE_UDF: execute user defined function with registered UDF 
*** FAILED *** (42 milliseconds)
15247[info]   java.lang.IllegalArgumentException: For parameter 'reason' value 
'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds for 
length 5' does not match: java.lang.StringIndexOutOfBoundsException: begin 5, 
end 6, length 5
15248[info]   at 
org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357)
15249[info]   at 
org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352)
15250[info]   at 
scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
15251[info]   at 
scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
15252[info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:933) 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46138) Clean up the use of `SQLContext` in hive-thriftserver module

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46138:
---
Labels: pull-request-available  (was: )

> Clean up the use of `SQLContext` in hive-thriftserver module
> 
>
> Key: SPARK-46138
> URL: https://issues.apache.org/jira/browse/SPARK-46138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46138) Clean up the use of `SQLContext` in hive-thriftserver module

2023-11-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-46138:


 Summary: Clean up the use of `SQLContext` in hive-thriftserver 
module
 Key: SPARK-46138
 URL: https://issues.apache.org/jira/browse/SPARK-46138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46137) Janino compiler has new version that fix issue with compilation

2023-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46137:
---
Labels: pull-request-available  (was: )

> Janino compiler has new version that fix issue with compilation
> ---
>
> Key: SPARK-46137
> URL: https://issues.apache.org/jira/browse/SPARK-46137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
>
> Janino released a new version (3.1.11) with a fix of a compilation of:
> {code:java}
> {
> } while (fasle) {code}
> [Link to github issue|https://github.com/janino-compiler/janino/issues/208]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46137) Janino compiler has new version that fix issue with compilation

2023-11-28 Thread Izek Greenfield (Jira)

Izek Greenfield created SPARK-46137:
---

 Summary: Janino compiler has new version that fix issue with 
compilation
 Key: SPARK-46137
 URL: https://issues.apache.org/jira/browse/SPARK-46137
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Izek Greenfield


Janino released a new version (3.1.11) with a fix of a compilation of:
{code:java}
{
} while (fasle) {code}
[Link to github issue|https://github.com/janino-compiler/janino/issues/208]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 106 matches

Mail list logo