[jira] [Resolved] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21
[ https://issues.apache.org/jira/browse/SPARK-46154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-46154. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44070 [https://github.com/apache/spark/pull/44070] > Add a new Daily testing GitHub Action job for Maven with Java 21 > > > Key: SPARK-46154 > URL: https://issues.apache.org/jira/browse/SPARK-46154 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46170: --- Labels: pull-request-available (was: ) > Support inject adaptive query post planner strategy rules in > SparkSessionExtensions > --- > > Key: SPARK-46170 > URL: https://issues.apache.org/jira/browse/SPARK-46170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
XiDuo You created SPARK-46170: - Summary: Support inject adaptive query post planner strategy rules in SparkSessionExtensions Key: SPARK-46170 URL: https://issues.apache.org/jira/browse/SPARK-46170 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46169) Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API.
[ https://issues.apache.org/jira/browse/SPARK-46169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46169: --- Labels: pull-request-available (was: ) > Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API. > -- > > Key: SPARK-46169 > URL: https://issues.apache.org/jira/browse/SPARK-46169 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > There are many TODO items has no actual JIRA number. We should assign proper > number for better tracking. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46169) Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API.
Haejoon Lee created SPARK-46169: --- Summary: Assign appropriate JIRA numbers to unlabeled TODO items for DataFrame API. Key: SPARK-46169 URL: https://issues.apache.org/jira/browse/SPARK-46169 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee There are many TODO items has no actual JIRA number. We should assign proper number for better tracking. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46168) Add axis parameter to DataFrame.idxmin & idxmax
Haejoon Lee created SPARK-46168: --- Summary: Add axis parameter to DataFrame.idxmin & idxmax Key: SPARK-46168 URL: https://issues.apache.org/jira/browse/SPARK-46168 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46167) Add axis, pct and na_option parameter to DataFrame.rank
Haejoon Lee created SPARK-46167: --- Summary: Add axis, pct and na_option parameter to DataFrame.rank Key: SPARK-46167 URL: https://issues.apache.org/jira/browse/SPARK-46167 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46165) Improve axis parameter for DataFrame.all to support columns.
Haejoon Lee created SPARK-46165: --- Summary: Improve axis parameter for DataFrame.all to support columns. Key: SPARK-46165 URL: https://issues.apache.org/jira/browse/SPARK-46165 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46166) Add axis and skipna parameters to DataFrame.any
Haejoon Lee created SPARK-46166: --- Summary: Add axis and skipna parameters to DataFrame.any Key: SPARK-46166 URL: https://issues.apache.org/jira/browse/SPARK-46166 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46164) Add include and exclude parameters for DataFrame.describe
Haejoon Lee created SPARK-46164: --- Summary: Add include and exclude parameters for DataFrame.describe Key: SPARK-46164 URL: https://issues.apache.org/jira/browse/SPARK-46164 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46163) Add filter_func and errors parameter for DataFrame.update
Haejoon Lee created SPARK-46163: --- Summary: Add filter_func and errors parameter for DataFrame.update Key: SPARK-46163 URL: https://issues.apache.org/jira/browse/SPARK-46163 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46160) Add freq and axis parameters to DataFrame.shift
Haejoon Lee created SPARK-46160: --- Summary: Add freq and axis parameters to DataFrame.shift Key: SPARK-46160 URL: https://issues.apache.org/jira/browse/SPARK-46160 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46162) Improve axis parameter for DataFrame.nunique to support columns.
Haejoon Lee created SPARK-46162: --- Summary: Improve axis parameter for DataFrame.nunique to support columns. Key: SPARK-46162 URL: https://issues.apache.org/jira/browse/SPARK-46162 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46161) Improve axis parameter for DataFrame.diff to support columns.
Haejoon Lee created SPARK-46161: --- Summary: Improve axis parameter for DataFrame.diff to support columns. Key: SPARK-46161 URL: https://issues.apache.org/jira/browse/SPARK-46161 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46159) Improve axis parameter for DataFrame.at_time to support columns.
Haejoon Lee created SPARK-46159: --- Summary: Improve axis parameter for DataFrame.at_time to support columns. Key: SPARK-46159 URL: https://issues.apache.org/jira/browse/SPARK-46159 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at_time.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46150) combine python codegen check and protobuf braking change
[ https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-46150: - Assignee: Ruifeng Zheng > combine python codegen check and protobuf braking change > > > Key: SPARK-46150 > URL: https://issues.apache.org/jira/browse/SPARK-46150 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46150) combine python codegen check and protobuf braking change
[ https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46150. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44067 [https://github.com/apache/spark/pull/44067] > combine python codegen check and protobuf braking change > > > Key: SPARK-46150 > URL: https://issues.apache.org/jira/browse/SPARK-46150 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46158) Improve axis parameter for DataFrame.xs to support columns.
Haejoon Lee created SPARK-46158: --- Summary: Improve axis parameter for DataFrame.xs to support columns. Key: SPARK-46158 URL: https://issues.apache.org/jira/browse/SPARK-46158 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.xs.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46157) Add `axis` parameter for DataFrame.aggregate.
Haejoon Lee created SPARK-46157: --- Summary: Add `axis` parameter for DataFrame.aggregate. Key: SPARK-46157 URL: https://issues.apache.org/jira/browse/SPARK-46157 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21
[ https://issues.apache.org/jira/browse/SPARK-46154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46154: --- Labels: pull-request-available (was: ) > Add a new Daily testing GitHub Action job for Maven with Java 21 > > > Key: SPARK-46154 > URL: https://issues.apache.org/jira/browse/SPARK-46154 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46154) Add a new Daily testing GitHub Action job for Maven with Java 21
Yang Jie created SPARK-46154: Summary: Add a new Daily testing GitHub Action job for Maven with Java 21 Key: SPARK-46154 URL: https://issues.apache.org/jira/browse/SPARK-46154 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46152) XML: Add DecimalType support in schema inference
[ https://issues.apache.org/jira/browse/SPARK-46152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46152: --- Labels: pull-request-available (was: ) > XML: Add DecimalType support in schema inference > > > Key: SPARK-46152 > URL: https://issues.apache.org/jira/browse/SPARK-46152 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46151) Hide the "More" drop-down button in the PySpark docs navigation bar
[ https://issues.apache.org/jira/browse/SPARK-46151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46151: --- Labels: pull-request-available (was: ) > Hide the "More" drop-down button in the PySpark docs navigation bar > --- > > Key: SPARK-46151 > URL: https://issues.apache.org/jira/browse/SPARK-46151 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46153) XML: Add TimestampNTZType support in schema inference
Sandip Agarwala created SPARK-46153: --- Summary: XML: Add TimestampNTZType support in schema inference Key: SPARK-46153 URL: https://issues.apache.org/jira/browse/SPARK-46153 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46152) XML: Add DecimalType support in schema inference
Sandip Agarwala created SPARK-46152: --- Summary: XML: Add DecimalType support in schema inference Key: SPARK-46152 URL: https://issues.apache.org/jira/browse/SPARK-46152 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46151) Hide the "More" drop-down button in the PySpark docs navigation bar
BingKun Pan created SPARK-46151: --- Summary: Hide the "More" drop-down button in the PySpark docs navigation bar Key: SPARK-46151 URL: https://issues.apache.org/jira/browse/SPARK-46151 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46150) combine python codegen check and protobuf braking change
[ https://issues.apache.org/jira/browse/SPARK-46150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46150: --- Labels: pull-request-available (was: ) > combine python codegen check and protobuf braking change > > > Key: SPARK-46150 > URL: https://issues.apache.org/jira/browse/SPARK-46150 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46150) combine python codegen check and protobuf braking change
Ruifeng Zheng created SPARK-46150: - Summary: combine python codegen check and protobuf braking change Key: SPARK-46150 URL: https://issues.apache.org/jira/browse/SPARK-46150 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46149: Assignee: Hyukjin Kwon > Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python > 3.12 > -- > > Key: SPARK-46149 > URL: https://issues.apache.org/jira/browse/SPARK-46149 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > {code} > == > ERROR [12.635s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > == > ERROR [14.850s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > -- > {code} > https://github.com/apache/spark/actions/runs/7020654429/job/19100964890 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46149. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44065 [https://github.com/apache/spark/pull/44065] > Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python > 3.12 > -- > > Key: SPARK-46149 > URL: https://issues.apache.org/jira/browse/SPARK-46149 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > == > ERROR [12.635s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > == > ERROR [14.850s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > -- > {code} > https://github.com/apache/spark/actions/runs/7020654429/job/19100964890 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46148. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44064 [https://github.com/apache/spark/pull/44064] > Fix pyspark.pandas.mlflow.load_model test (Python 3.12) > --- > > Key: SPARK-46148 > URL: https://issues.apache.org/jira/browse/SPARK-46148 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.10/doctest.py", line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 213, in toPandas > rows = self.collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", > line 188, in deco > raise converted from None > pyspark.errors.exceptions.captured.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1523, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1515, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 485, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 101, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 478, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1284, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1619, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1383, in _predict_row_batch > result = predict_fn(pdf, params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1601, in batch_predict_fn > return loaded_model.predict(pdf, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 491, in predict > return _predict() > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 477, in _predict > return self._predict_fn(data, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line > 517, in predict > return self.sklearn_model.predict(data) > File > "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line > 386, in predict > return self._decision_function(X)
[jira] [Assigned] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46148: Assignee: Hyukjin Kwon > Fix pyspark.pandas.mlflow.load_model test (Python 3.12) > --- > > Key: SPARK-46148 > URL: https://issues.apache.org/jira/browse/SPARK-46148 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > {code} > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.10/doctest.py", line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 213, in toPandas > rows = self.collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", > line 188, in deco > raise converted from None > pyspark.errors.exceptions.captured.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1523, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1515, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 485, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 101, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 478, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1284, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1619, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1383, in _predict_row_batch > result = predict_fn(pdf, params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1601, in batch_predict_fn > return loaded_model.predict(pdf, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 491, in predict > return _predict() > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 477, in _predict > return self._predict_fn(data, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line > 517, in predict > return self.sklearn_model.predict(data) > File > "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line > 386, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line > 369, in _decision_function >
[jira] [Resolved] (SPARK-46146) Unpin `markupsafe`
[ https://issues.apache.org/jira/browse/SPARK-46146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46146. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44062 [https://github.com/apache/spark/pull/44062] > Unpin `markupsafe` > -- > > Key: SPARK-46146 > URL: https://issues.apache.org/jira/browse/SPARK-46146 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46003) Create an ui-test module with Jest
[ https://issues.apache.org/jira/browse/SPARK-46003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46003: - Assignee: Kent Yao > Create an ui-test module with Jest > -- > > Key: SPARK-46003 > URL: https://issues.apache.org/jira/browse/SPARK-46003 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests, UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46003) Create an ui-test module with Jest
[ https://issues.apache.org/jira/browse/SPARK-46003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46003. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43903 [https://github.com/apache/spark/pull/43903] > Create an ui-test module with Jest > -- > > Key: SPARK-46003 > URL: https://issues.apache.org/jira/browse/SPARK-46003 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests, UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46147. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44063 [https://github.com/apache/spark/pull/44063] > Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) > - > > Key: SPARK-46147 > URL: https://issues.apache.org/jira/browse/SPARK-46147 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in > pyspark.pandas.series.Series.to_dict > Failed example: > s.to_dict(OrderedDict) > Expected: > OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) > Got: > OrderedDict({0: 1, 1: 2, 2: 3, 3: 4}) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46140) Remove comments about JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46140: -- Summary: Remove comments about JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite` (was: Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`) > Remove comments about JVM module options from the test submission options of > `HiveExternalCatalogVersionsSuite` > --- > > Key: SPARK-46140 > URL: https://issues.apache.org/jira/browse/SPARK-46140 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Complete TODO: > {code:java} > val args = Seq( > "--name", "prepare testing tables", > "--master", "local[2]", > "--conf", s"${UI_ENABLED.key}=false", > "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false", > "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion", > "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven", > "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}", > "--conf", s"spark.sql.test.version.index=$index", > "--driver-java-options", > s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " + > // TODO SPARK-37159 Consider to remove the following > // JVM module options once the Spark 3.2 line is EOL. > JavaModuleOptions.defaultModuleOptions(), > tempPyFile.getCanonicalPath) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46140: - Assignee: Yang Jie > Remove no longer needed JVM module options from the test submission options > of `HiveExternalCatalogVersionsSuite` > - > > Key: SPARK-46140 > URL: https://issues.apache.org/jira/browse/SPARK-46140 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > > Complete TODO: > {code:java} > val args = Seq( > "--name", "prepare testing tables", > "--master", "local[2]", > "--conf", s"${UI_ENABLED.key}=false", > "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false", > "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion", > "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven", > "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}", > "--conf", s"spark.sql.test.version.index=$index", > "--driver-java-options", > s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " + > // TODO SPARK-37159 Consider to remove the following > // JVM module options once the Spark 3.2 line is EOL. > JavaModuleOptions.defaultModuleOptions(), > tempPyFile.getCanonicalPath) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46140. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44057 [https://github.com/apache/spark/pull/44057] > Remove no longer needed JVM module options from the test submission options > of `HiveExternalCatalogVersionsSuite` > - > > Key: SPARK-46140 > URL: https://issues.apache.org/jira/browse/SPARK-46140 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Complete TODO: > {code:java} > val args = Seq( > "--name", "prepare testing tables", > "--master", "local[2]", > "--conf", s"${UI_ENABLED.key}=false", > "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false", > "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion", > "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven", > "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}", > "--conf", s"spark.sql.test.version.index=$index", > "--driver-java-options", > s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " + > // TODO SPARK-37159 Consider to remove the following > // JVM module options once the Spark 3.2 line is EOL. > JavaModuleOptions.defaultModuleOptions(), > tempPyFile.getCanonicalPath) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46149: --- Labels: pull-request-available (was: ) > Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python > 3.12 > -- > > Key: SPARK-46149 > URL: https://issues.apache.org/jira/browse/SPARK-46149 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > {code} > == > ERROR [12.635s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > == > ERROR [14.850s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > -- > {code} > https://github.com/apache/spark/actions/runs/7020654429/job/19100964890 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46149: - Description: {code} == ERROR [12.635s]: test_end_to_end_run_locally (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", line 403, in test_end_to_end_run_locally output = TorchDistributor(num_processes=2, local_mode=True, use_gpu=False).run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in run return self._run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in _run output = self._run_local_training( ^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in _run_local_training output = TorchDistributor._get_output_from_framework_wrapper( File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in _get_output_from_framework_wrapper return framework_wrapper( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in _run_training_on_pytorch_function raise RuntimeError( RuntimeError: TorchDistributor failed during training.View stdout logs for detailed error message. == ERROR [14.850s]: test_end_to_end_run_locally (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", line 403, in test_end_to_end_run_locally output = TorchDistributor(num_processes=2, local_mode=True, use_gpu=False).run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in run return self._run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in _run output = self._run_local_training( ^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in _run_local_training output = TorchDistributor._get_output_from_framework_wrapper( File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in _get_output_from_framework_wrapper return framework_wrapper( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in _run_training_on_pytorch_function raise RuntimeError( RuntimeError: TorchDistributor failed during training.View stdout logs for detailed error message. -- {code} https://github.com/apache/spark/actions/runs/7020654429/job/19100964890 was: {code} == ERROR [12.635s]: test_end_to_end_run_locally (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", line 403, in test_end_to_end_run_locally output = TorchDistributor(num_processes=2, local_mode=True, use_gpu=False).run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in run return self._run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in _run output = self._run_local_training( ^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in _run_local_training output = TorchDistributor._get_output_from_framework_wrapper( File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in _get_output_from_framework_wrapper return framework_wrapper( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in _run_training_on_pytorch_function
[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46149: - Summary: Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python 3.12 (was: Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`) > Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` with Python > 3.12 > -- > > Key: SPARK-46149 > URL: https://issues.apache.org/jira/browse/SPARK-46149 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > == > ERROR [12.635s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > == > ERROR [14.850s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > -- > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`
[ https://issues.apache.org/jira/browse/SPARK-46149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46149: - Priority: Minor (was: Major) > Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` > - > > Key: SPARK-46149 > URL: https://issues.apache.org/jira/browse/SPARK-46149 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > == > ERROR [12.635s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > == > ERROR [14.850s]: test_end_to_end_run_locally > (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", > line 403, in test_end_to_end_run_locally > output = TorchDistributor(num_processes=2, local_mode=True, > use_gpu=False).run( > > ^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, > in run > return self._run( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, > in _run > output = self._run_local_training( > ^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, > in _run_local_training > output = TorchDistributor._get_output_from_framework_wrapper( > > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, > in _get_output_from_framework_wrapper > return framework_wrapper( >^^ > File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, > in _run_training_on_pytorch_function > raise RuntimeError( > RuntimeError: TorchDistributor failed during training.View stdout logs for > detailed error message. > -- > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46149) Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally`
Hyukjin Kwon created SPARK-46149: Summary: Skip `TorchDistributorLocalUnitTests.test_end_to_end_run_locally` Key: SPARK-46149 URL: https://issues.apache.org/jira/browse/SPARK-46149 Project: Spark Issue Type: Sub-task Components: ML, PySpark, Tests Affects Versions: 4.0.0 Reporter: Hyukjin Kwon {code} == ERROR [12.635s]: test_end_to_end_run_locally (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsIIOnConnect.test_end_to_end_run_locally) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", line 403, in test_end_to_end_run_locally output = TorchDistributor(num_processes=2, local_mode=True, use_gpu=False).run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in run return self._run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in _run output = self._run_local_training( ^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in _run_local_training output = TorchDistributor._get_output_from_framework_wrapper( File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in _get_output_from_framework_wrapper return framework_wrapper( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in _run_training_on_pytorch_function raise RuntimeError( RuntimeError: TorchDistributor failed during training.View stdout logs for detailed error message. == ERROR [14.850s]: test_end_to_end_run_locally (pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorLocalUnitTestsOnConnect.test_end_to_end_run_locally) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", line 403, in test_end_to_end_run_locally output = TorchDistributor(num_processes=2, local_mode=True, use_gpu=False).run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 969, in run return self._run( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 985, in _run output = self._run_local_training( ^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 593, in _run_local_training output = TorchDistributor._get_output_from_framework_wrapper( File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 567, in _get_output_from_framework_wrapper return framework_wrapper( ^^ File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 908, in _run_training_on_pytorch_function raise RuntimeError( RuntimeError: TorchDistributor failed during training.View stdout logs for detailed error message. -- {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46148: - Description: {code} ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in pyspark.pandas.mlflow.load_model Failed example: prediction_df Exception raised: Traceback (most recent call last): File "/usr/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in prediction_df File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in __repr__ pdf = cast("DataFrame", self._get_or_create_repr_pandas_cache(max_display_count)) File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in _get_or_create_repr_pandas_cache self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in _to_internal_pandas return self._internal.to_pandas_frame File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, in to_pandas_frame pdf = sdf.toPandas() File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 213, in toPandas rows = self.collect() File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in collect sock_info = self._jdf.collectToPython() File "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 188, in deco raise converted from None pyspark.errors.exceptions.captured.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1523, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1515, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 485, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 101, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 478, in init_stream_yield_batches for series in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1284, in func for result_batch, result_type in result_iter: File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1619, in udf yield _predict_row_batch(batch_predict_fn, row_batch_args) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1383, in _predict_row_batch result = predict_fn(pdf, params) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1601, in batch_predict_fn return loaded_model.predict(pdf, params=params) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 491, in predict return _predict() File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 477, in _predict return self._predict_fn(data, params=params) File "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 517, in predict return self.sklearn_model.predict(data) File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 386, in predict return self._decision_function(X) File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 369, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 580, in _validate_data self._check_feature_names(X, reset=reset) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 507, in _check_feature_names raise ValueError(message) ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time: - 0 - 1 Feature names seen at fit time, yet now missing: - x1 - x2 JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent
[jira] [Updated] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46148: --- Labels: pull-request-available (was: ) > Fix pyspark.pandas.mlflow.load_model test (Python 3.12) > --- > > Key: SPARK-46148 > URL: https://issues.apache.org/jira/browse/SPARK-46148 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > {code} > ** > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.10/doctest.py", line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 213, in toPandas > rows = self.collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", > line 188, in deco > raise converted from None > pyspark.errors.exceptions.captured.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1523, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1515, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 485, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 101, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 478, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 1284, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1619, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1383, in _predict_row_batch > result = predict_fn(pdf, params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 1601, in batch_predict_fn > return loaded_model.predict(pdf, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 491, in predict > return _predict() > File > "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line > 477, in _predict > return self._predict_fn(data, params=params) > File > "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line > 517, in predict > return self.sklearn_model.predict(data) > File > "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line > 386, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line > 369, in _decision_function > X =
[jira] [Created] (SPARK-46148) Fix pyspark.pandas.mlflow.load_model test (Python 3.12)
Hyukjin Kwon created SPARK-46148: Summary: Fix pyspark.pandas.mlflow.load_model test (Python 3.12) Key: SPARK-46148 URL: https://issues.apache.org/jira/browse/SPARK-46148 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon {code} ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in pyspark.pandas.mlflow.load_model Failed example: prediction_df Exception raised: Traceback (most recent call last): File "/usr/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in prediction_df File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13291, in __repr__ pdf = cast("DataFrame", self._get_or_create_repr_pandas_cache(max_display_count)) File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13282, in _get_or_create_repr_pandas_cache self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13277, in _to_internal_pandas return self._internal.to_pandas_frame File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 599, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1110, in to_pandas_frame pdf = sdf.toPandas() File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 213, in toPandas rows = self.collect() File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1369, in collect sock_info = self._jdf.collectToPython() File "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 188, in deco raise converted from None pyspark.errors.exceptions.captured.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1523, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1515, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 485, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 101, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 478, in init_stream_yield_batches for series in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1284, in func for result_batch, result_type in result_iter: File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1619, in udf yield _predict_row_batch(batch_predict_fn, row_batch_args) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1383, in _predict_row_batch result = predict_fn(pdf, params) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 1601, in batch_predict_fn return loaded_model.predict(pdf, params=params) File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 491, in predict return _predict() File "/usr/local/lib/python3.10/dist-packages/mlflow/pyfunc/__init__.py", line 477, in _predict return self._predict_fn(data, params=params) File "/usr/local/lib/python3.10/dist-packages/mlflow/sklearn/__init__.py", line 517, in predict return self.sklearn_model.predict(data) File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 386, in predict return self._decision_function(X) File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_base.py", line 369, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 580, in _validate_data self._check_feature_names(X, reset=reset) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 507, in _check_feature_names raise ValueError(message) ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time: - 0 - 1 Feature names seen
[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46147: --- Labels: pull-request-available (was: ) > Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) > - > > Key: SPARK-46147 > URL: https://issues.apache.org/jira/browse/SPARK-46147 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > {code} > File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in > pyspark.pandas.series.Series.to_dict > Failed example: > s.to_dict(OrderedDict) > Expected: > OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) > Got: > OrderedDict({0: 1, 1: 2, 2: 3, 3: 4}) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46147: - Fix Version/s: (was: 4.0.0) > Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) > - > > Key: SPARK-46147 > URL: https://issues.apache.org/jira/browse/SPARK-46147 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > {code} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in > pyspark.pandas.frame.DataFrame.to_dict > Failed example: > df.to_dict(into=OrderedDict) > Expected: > OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', > OrderedDict([('row1', 0.5), ('row2', 0.75)]))]) > Got: > OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': > OrderedDict({'row1': 0.5, 'row2': 0.75})}) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46147: - Labels: (was: pull-request-available) > Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) > - > > Key: SPARK-46147 > URL: https://issues.apache.org/jira/browse/SPARK-46147 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 4.0.0 > > > {code} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in > pyspark.pandas.frame.DataFrame.to_dict > Failed example: > df.to_dict(into=OrderedDict) > Expected: > OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', > OrderedDict([('row1', 0.5), ('row2', 0.75)]))]) > Got: > OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': > OrderedDict({'row1': 0.5, 'row2': 0.75})}) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
[ https://issues.apache.org/jira/browse/SPARK-46147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46147: - Description: {code} File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in pyspark.pandas.series.Series.to_dict Failed example: s.to_dict(OrderedDict) Expected: OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) Got: OrderedDict({0: 1, 1: 2, 2: 3, 3: 4}) {code} was: {code} File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in pyspark.pandas.frame.DataFrame.to_dict Failed example: df.to_dict(into=OrderedDict) Expected: OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))]) Got: OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': OrderedDict({'row1': 0.5, 'row2': 0.75})}) {code} > Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) > - > > Key: SPARK-46147 > URL: https://issues.apache.org/jira/browse/SPARK-46147 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > {code} > File "/__w/spark/spark/python/pyspark/pandas/series.py", line 1633, in > pyspark.pandas.series.Series.to_dict > Failed example: > s.to_dict(OrderedDict) > Expected: > OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) > Got: > OrderedDict({0: 1, 1: 2, 2: 3, 3: 4}) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46147) Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12)
Hyukjin Kwon created SPARK-46147: Summary: Fix the doctest in pyspark.pandas.series.Series.to_dict (Python 3.12) Key: SPARK-46147 URL: https://issues.apache.org/jira/browse/SPARK-46147 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Fix For: 4.0.0 {code} File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2515, in pyspark.pandas.frame.DataFrame.to_dict Failed example: df.to_dict(into=OrderedDict) Expected: OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))]) Got: OrderedDict({'col1': OrderedDict({'row1': 1, 'row2': 2}), 'col2': OrderedDict({'row1': 0.5, 'row2': 0.75})}) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46146) Unpin `markupsafe`
[ https://issues.apache.org/jira/browse/SPARK-46146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46146: --- Labels: pull-request-available (was: ) > Unpin `markupsafe` > -- > > Key: SPARK-46146 > URL: https://issues.apache.org/jira/browse/SPARK-46146 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46146) Unpin `markupsafe`
Ruifeng Zheng created SPARK-46146: - Summary: Unpin `markupsafe` Key: SPARK-46146 URL: https://issues.apache.org/jira/browse/SPARK-46146 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46029) Escape the single quote, _ and % for DS V2 pushdown
[ https://issues.apache.org/jira/browse/SPARK-46029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46029. - Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43801 [https://github.com/apache/spark/pull/43801] > Escape the single quote, _ and % for DS V2 pushdown > --- > > Key: SPARK-46029 > URL: https://issues.apache.org/jira/browse/SPARK-46029 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: correctness, pull-request-available > Fix For: 3.5.1, 4.0.0 > > > Spark supports push down startsWith, endWith and contains to JDBC database > with DS V2 pushdown. > But the V2ExpressionSQLBuilder didn't escape the single quote, _ and %, it > can cause unexpected result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46108) XML: keepInnerXmlAsRaw option
[ https://issues.apache.org/jira/browse/SPARK-46108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46108: --- Labels: pull-request-available (was: ) > XML: keepInnerXmlAsRaw option > - > > Key: SPARK-46108 > URL: https://issues.apache.org/jira/browse/SPARK-46108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Ufuk Süngü >Priority: Major > Labels: pull-request-available > > Built-in XML data source gives related value and schema of the inner or > nested elements. However, additional operations should be made by developers > manually to convert unstructured data to structured, tabular format. If > nested elements are kept in a format that is suitable with XML (for each > level), we can convert them easily to a structured, tabular format with the > existing methods that have already been developed (infer method of > XmlInferSchema and parseColumn method of StaxXmlParser). Therefore there > should be an option that affects StaxXmlParser and InferSchema classes to > keep inner XML elements in their original or raw format. > https://github.com/apache/spark/pull/44022 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46055) Refactor Catalog Database APIs implementation
[ https://issues.apache.org/jira/browse/SPARK-46055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46055. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43959 [https://github.com/apache/spark/pull/43959] > Refactor Catalog Database APIs implementation > - > > Key: SPARK-46055 > URL: https://issues.apache.org/jira/browse/SPARK-46055 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46142) Remove `dev/ansible-for-test-node` directory
[ https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-46142: Assignee: Dongjoon Hyun > Remove `dev/ansible-for-test-node` directory > > > Key: SPARK-46142 > URL: https://issues.apache.org/jira/browse/SPARK-46142 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46142) Remove `dev/ansible-for-test-node` directory
[ https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-46142. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44059 [https://github.com/apache/spark/pull/44059] > Remove `dev/ansible-for-test-node` directory > > > Key: SPARK-46142 > URL: https://issues.apache.org/jira/browse/SPARK-46142 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46145) spark.catalog.listTables does not throw exception when the table or view is not found
[ https://issues.apache.org/jira/browse/SPARK-46145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46145: --- Labels: pull-request-available (was: ) > spark.catalog.listTables does not throw exception when the table or view is > not found > - > > Key: SPARK-46145 > URL: https://issues.apache.org/jira/browse/SPARK-46145 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46145) spark.catalog.listTables does not throw exception when the table or view is not found
Rui Wang created SPARK-46145: Summary: spark.catalog.listTables does not throw exception when the table or view is not found Key: SPARK-46145 URL: https://issues.apache.org/jira/browse/SPARK-46145 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery
[ https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46144: --- Labels: pull-request-available (was: ) > Fail INSERT INTO ... REPLACE statement if the condition contains subquery > - > > Key: SPARK-46144 > URL: https://issues.apache.org/jira/browse/SPARK-46144 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > Labels: pull-request-available > > For the following query: > {code:java} > INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT > * FROM source{code} > There will be an analysis error: > {code:java} > [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function > parameter with name `c2` cannot be resolved. SQLSTATE: 42703; line 1 pos 51; > 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, > (id#27L = scalar-subquery#26 []), false {code} > The error message is confusing. The actually reason is the > OverwriteByExpression plan doesn't support subqueries. While supporting the > feature is non-trivial, we should improve the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery
[ https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-46144: --- Description: For the following query: {code:java} INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * FROM source{code} There will be an analysis error: {code:java} [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name `c2` cannot be resolved. SQLSTATE: 42703; line 1 pos 51; 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, (id#27L = scalar-subquery#26 []), false {code} The error message is confusing. The actually reason is the OverwriteByExpression plan doesn't support subqueries. While supporting the feature is non-trivial, we should improve the error message. was: For the following query: {code:java} INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * FROM source{code} There will be analysis > Fail INSERT INTO ... REPLACE statement if the condition contains subquery > - > > Key: SPARK-46144 > URL: https://issues.apache.org/jira/browse/SPARK-46144 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > > For the following query: > {code:java} > INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT > * FROM source{code} > There will be an analysis error: > {code:java} > [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function > parameter with name `c2` cannot be resolved. SQLSTATE: 42703; line 1 pos 51; > 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, > (id#27L = scalar-subquery#26 []), false {code} > The error message is confusing. The actually reason is the > OverwriteByExpression plan doesn't support subqueries. While supporting the > feature is non-trivial, we should improve the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery
[ https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-46144: --- Description: For the following query: {code:java} INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * FROM source{code} There will be analysis > Fail INSERT INTO ... REPLACE statement if the condition contains subquery > - > > Key: SPARK-46144 > URL: https://issues.apache.org/jira/browse/SPARK-46144 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > > For the following query: > {code:java} > INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT > * FROM source{code} > There will be analysis -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery
[ https://issues.apache.org/jira/browse/SPARK-46144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-46144: --- Affects Version/s: 4.0.0 (was: 3.5.0) > Fail INSERT INTO ... REPLACE statement if the condition contains subquery > - > > Key: SPARK-46144 > URL: https://issues.apache.org/jira/browse/SPARK-46144 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46144) Fail INSERT INTO ... REPLACE statement if the condition contains subquery
Gengliang Wang created SPARK-46144: -- Summary: Fail INSERT INTO ... REPLACE statement if the condition contains subquery Key: SPARK-46144 URL: https://issues.apache.org/jira/browse/SPARK-46144 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.5.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46125) Memory leak when using createDataFrame with persist
[ https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790751#comment-17790751 ] Josh Rosen commented on SPARK-46125: I think that this issue relates specifically to `createDataFrame` and other mechanisms for creating Datasets or RDDs from driver-side data. I was able to reproduce the memory effects that you reported using a synthetic dataset: {code:java} n_rows = 100 data = np.random.randn(n_rows, n_cols) pdf = pd.DataFrame(data, columns=[f'Column_{i}' for i in range(n_cols)]) {code} I took heap dumps in the "with unpersist" and "without unpersist" cases and saw that most of the difference was due to `byte[]` arrays. That, in turn, is due to ParallelCollectionPartitions being kept alive in a ParallelCollectionRDD that is retained by the CacheManager. When you cache a query, Spark keeps the physical query plan alive so that it can recompute cached data if it is lost (e.g. due to a node failure). For Datasets or RDDs that are created from data on the driver, that driver-side data is kept alive. It's this CacheManager reference to the physical plan which is keeping the source RDD from being cleaned: this is why `del df` followed by GC does not clean up the RDD's memory. --- If you use `localCheckpoint` then Spark will persist the data to disk and truncate the RDD lineage, thereby avoiding driver-side memory consumption from the parallel collection RDD, but this will have the side effect of removing fault-tolerance: if any node is lost then the data will be lost and any attempts to access it will result in query failures. !image-2023-11-28-12-55-58-461.png! > Memory leak when using createDataFrame with persist > --- > > Key: SPARK-46125 > URL: https://issues.apache.org/jira/browse/SPARK-46125 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 3.5.0 >Reporter: Arman Yazdani >Priority: Major > Labels: PySpark, memory-leak, persist > Attachments: CreateDataFrameWithUnpersist.png, > CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png, > image-2023-11-28-12-55-58-461.png > > > When I create a dataset from pandas data frame and persisting it (DISK_ONLY), > some "byte[]" objects (total size of imported data frame) will still remain > in the driver's heap memory. > This is the sample code for reproducing it: > {code:python} > import pandas as pd > import gc > from pyspark.sql import SparkSession > from pyspark.storagelevel import StorageLevel > spark = SparkSession.builder \ > .config("spark.driver.memory", "4g") \ > .config("spark.executor.memory", "4g") \ > .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ > .getOrCreate() > pdf = pd.read_pickle('tmp/input.pickle') > df = spark.createDataFrame(pdf) > df = df.persist(storageLevel=StorageLevel.DISK_ONLY) > df.count() > del pdf > del df > gc.collect() > spark.sparkContext._jvm.System.gc(){code} > After running this code, I will perform a manual GC in VisualVM, but the > driver memory usage will remain at 550 MBs (at start it was about 50 MBs). > !CreateDataFrameWithoutUnpersist.png|width=467,height=349! > Then I tested with adding {{"df = df.unpersist()"}} after the > {{"df.count()"}} line and everything was OK (Memory usage after performing > manual GC was about 50 MBs). > !CreateDataFrameWithUnpersist.png|width=468,height=300! > Also, I tried with reading from parquet file (without adding unpersist line) > with this code: > {code:python} > import gc > from pyspark.sql import SparkSession > from pyspark.storagelevel import StorageLevel > spark = SparkSession.builder \ > .config("spark.driver.memory", "4g") \ > .config("spark.executor.memory", "4g") \ > .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ > .getOrCreate() > df = spark.read.parquet('tmp/input.parquet') > df = df.persist(storageLevel=StorageLevel.DISK_ONLY) > df.count() > del df > gc.collect() > spark.sparkContext._jvm.System.gc(){code} > Again everything was fine and memory usage was about 50 MBs after performing > manual GC. > !ReadParquetWithoutUnpersist.png|width=473,height=302! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46125) Memory leak when using createDataFrame with persist
[ https://issues.apache.org/jira/browse/SPARK-46125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-46125: --- Attachment: image-2023-11-28-12-55-58-461.png > Memory leak when using createDataFrame with persist > --- > > Key: SPARK-46125 > URL: https://issues.apache.org/jira/browse/SPARK-46125 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 3.5.0 >Reporter: Arman Yazdani >Priority: Major > Labels: PySpark, memory-leak, persist > Attachments: CreateDataFrameWithUnpersist.png, > CreateDataFrameWithoutUnpersist.png, ReadParquetWithoutUnpersist.png, > image-2023-11-28-12-55-58-461.png > > > When I create a dataset from pandas data frame and persisting it (DISK_ONLY), > some "byte[]" objects (total size of imported data frame) will still remain > in the driver's heap memory. > This is the sample code for reproducing it: > {code:python} > import pandas as pd > import gc > from pyspark.sql import SparkSession > from pyspark.storagelevel import StorageLevel > spark = SparkSession.builder \ > .config("spark.driver.memory", "4g") \ > .config("spark.executor.memory", "4g") \ > .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ > .getOrCreate() > pdf = pd.read_pickle('tmp/input.pickle') > df = spark.createDataFrame(pdf) > df = df.persist(storageLevel=StorageLevel.DISK_ONLY) > df.count() > del pdf > del df > gc.collect() > spark.sparkContext._jvm.System.gc(){code} > After running this code, I will perform a manual GC in VisualVM, but the > driver memory usage will remain at 550 MBs (at start it was about 50 MBs). > !CreateDataFrameWithoutUnpersist.png|width=467,height=349! > Then I tested with adding {{"df = df.unpersist()"}} after the > {{"df.count()"}} line and everything was OK (Memory usage after performing > manual GC was about 50 MBs). > !CreateDataFrameWithUnpersist.png|width=468,height=300! > Also, I tried with reading from parquet file (without adding unpersist line) > with this code: > {code:python} > import gc > from pyspark.sql import SparkSession > from pyspark.storagelevel import StorageLevel > spark = SparkSession.builder \ > .config("spark.driver.memory", "4g") \ > .config("spark.executor.memory", "4g") \ > .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ > .getOrCreate() > df = spark.read.parquet('tmp/input.parquet') > df = df.persist(storageLevel=StorageLevel.DISK_ONLY) > df.count() > del df > gc.collect() > spark.sparkContext._jvm.System.gc(){code} > Again everything was fine and memory usage was about 50 MBs after performing > manual GC. > !ReadParquetWithoutUnpersist.png|width=473,height=302! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
[ https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790742#comment-17790742 ] Josh Rosen commented on SPARK-46105: {quote}The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. {quote} To clarify: by "returns 0 records", are you referring to the record count of the data frame (i.e. whether isEmpty returns true or false) or to the partition count? In other words, are you saying that EMR returns an incorrect record count or do you mean that it returns an unexpected partition count? > df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above > --- > > Key: SPARK-46105 > URL: https://issues.apache.org/jira/browse/SPARK-46105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.3 > Environment: EKS > EMR >Reporter: dharani_sugumar >Priority: Major > Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png > > > {color:#FF}Version: 3.3.3{color} > > {color:#FF}scala> val df = spark.emptyDataFrame{color} > {color:#FF}df: org.apache.spark.sql.DataFrame = []{color} > {color:#FF}scala> df.rdd.getNumPartitions{color} > {color:#FF}res0: Int = 0{color} > {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color} > {color:#FF}res1: Int = 1{color} > {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color} > {color:#FF}[Stage 1:> > (0 + 1) / > res2: Boolean = true{color} > Version: 3.2.4 > scala> val df = spark.emptyDataFrame > df: org.apache.spark.sql.DataFrame = [] > scala> df.rdd.getNumPartitions > res0: Int = 0 > scala> df.repartition(1).rdd.getNumPartitions > res1: Int = 0 > scala> df.repartition(1).rdd.isEmpty() > res2: Boolean = true > > {color:#FF}Version: 3.5.0{color} > {color:#FF}scala> val df = spark.emptyDataFrame{color} > {color:#FF}df: org.apache.spark.sql.DataFrame = []{color} > {color:#FF}scala> df.rdd.getNumPartitions{color} > {color:#FF}res0: Int = 0{color} > {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color} > {color:#FF}res1: Int = 1{color} > {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color} > {color:#FF}[Stage 1:> > (0 + 1) / > res2: Boolean = true{color} > > When we do repartition of 1 on an empty dataframe, the resultant partition is > 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the > resultant partition is 0. May i know why this behaviour is changed from 3.2.x > to higher versions. > > The reason for raising this as a bug is I have a scenario where my final > dataframe returns 0 records in EKS(local spark) with single node(driver and > executor on the sam node) but it returns 1 in EMR both uses a same spark > version 3.3.3. I'm not sure why this behaves different in both the > environments. As a interim solution, I had to repartition a empty dataframe > if my final dataframe is empty which returns 1 for 3.3.3. Would like to know > if this really a bug or this behaviour exists in the future versions and > cannot be changed? > > Because, If we go for a spark upgrade and this behaviour is changed, we will > face the issue again. > Please confirm on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Environment: pyspark 3.4.1.5.3 build 20230713. Running on Microsoft Fabric workspace. was: Apache spark 3.4.1.5.3 build 20230713. Running on Microsoft Fabric workspace. > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a parameter called > "squeeze" however it's implemented as part of pyspark.pandas and the > parameter "squeeze" is being passed to the pandas function. > > !image-2023-11-28-13-20-40-275.png! > > I've been digging into it for further investigation into pyspark 3.4.1 > documentation > [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] > > This is the point I found that "squeeze" parameter is being passed to pandas > read_excel function which is not expected. > It seems like it was deprecated as part of pyspark 3.4.0 but still being used > in the implementation. > > !image-2023-11-28-13-20-51-291.png! > > I believe this is an issue with pyspark implementation 3.4.1 not necessaily > with fabric. However fabric uses this version as its 1.2 build. > > I am able to work around that for now by download the excel from the one lake > to the spark driver, loading that to the memory with pandas and then > converting to a spark dataframe etc or I made it work downgrading the build > I downloaded the pyspark build 20230713 to my local, made the changes and > re-compiled it and it worked locally. So it means that is related to the > implementation and they would have to fix or I do a downgrade to older > version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Description: Hello, I would like to report an issue with pyspark.pandas implementation on read_excel function. Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which potentially uses an older version of pandas on it's implementations of pyspark.pandas. The function read_excel from pandas doesn't expect a parameter called "squeeze" however it's implemented as part of pyspark.pandas and the parameter "squeeze" is being passed to the pandas function. !image-2023-11-28-13-20-40-275.png! I've been digging into it for further investigation into pyspark 3.4.1 documentation [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] This is the point I found that "squeeze" parameter is being passed to pandas read_excel function which is not expected. It seems like it was deprecated as part of pyspark 3.4.0 but still being used in the implementation. !image-2023-11-28-13-20-51-291.png! I believe this is an issue with pyspark implementation 3.4.1 not necessaily with fabric. However fabric uses this version as its 1.2 build. I am able to work around that for now by download the excel from the one lake to the spark driver, loading that to the memory with pandas and then converting to a spark dataframe etc or I made it work downgrading the build I downloaded the pyspark build 20230713 to my local, made the changes and re-compiled it and it worked locally. So it means that is related to the implementation and they would have to fix or I do a downgrade to older version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric was: Hello, I would like to report an issue with pyspark.pandas implementation on read_excel function. Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which potentially uses an older version of pandas on it's implementations of pyspark.pandas. The function read_excel from pandas doesn't expect a parameter called "squeeze" however it's implemented as part of pyspark.pandas and the parameter "squeeze" is being passed to the pandas function. !image-2023-11-28-13-20-40-275.png! I've been digging into it for further investigation into pyspark 3.4.1 documentation [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] This is the point I found that "squeeze" parameter is being passed to pandas read_excel function which is not expected. It seems like it was deprecated as part of pyspark 3.4.0 but still being used in the implementation. !image-2023-11-28-13-20-51-291.png! I believe this is an issue with pyspark implementation 3.4.1 not necessaily with fabric. However fabric uses this version as its 1.2 build. I am able to work around that for now by download the excel from the one lake to the spark driver, loading that to the memory with pandas and then converting to a spark dataframe etc or I made it work downgrading the build I downloaded the pyspark build 20230713 to my local, made the changes and re-compiled it and it worked locally. So it means that is related to the implementation and they would have to fix or I do a downgrade to older version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: Apache spark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. >
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Attachment: image-2023-11-28-13-20-51-291.png > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: Apache spark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a parameter called > "squeeze" however it's implemented as part of pyspark.pandas and the > parameter "squeeze" is being passed to the pandas function. > > !Z! > > I've been digging into it for further investigation into pyspark 3.4.1 > documentation > [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] > > This is the point I found that "squeeze" parameter is being passed to pandas > read_excel function which is not expected. > It seems like it was deprecated as part of pyspark 3.4.0 but still being used > in the implementation. > > !9k=! > > I believe this is an issue with pyspark implementation 3.4.1 not necessaily > with fabric. However fabric uses this version as its 1.2 build. > > I am able to work around that for now by download the excel from the one lake > to the spark driver, loading that to the memory with pandas and then > converting to a spark dataframe etc or I made it work downgrading the build > I downloaded the pyspark build 20230713 to my local, made the changes and > re-compiled it and it worked locally. So it means that is related to the > implementation and they would have to fix or I do a downgrade to older > version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Attachment: image-2023-11-28-13-20-40-275.png > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: Apache spark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a parameter called > "squeeze" however it's implemented as part of pyspark.pandas and the > parameter "squeeze" is being passed to the pandas function. > > !Z! > > I've been digging into it for further investigation into pyspark 3.4.1 > documentation > [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] > > This is the point I found that "squeeze" parameter is being passed to pandas > read_excel function which is not expected. > It seems like it was deprecated as part of pyspark 3.4.0 but still being used > in the implementation. > > !9k=! > > I believe this is an issue with pyspark implementation 3.4.1 not necessaily > with fabric. However fabric uses this version as its 1.2 build. > > I am able to work around that for now by download the excel from the one lake > to the spark driver, loading that to the memory with pandas and then > converting to a spark dataframe etc or I made it work downgrading the build > I downloaded the pyspark build 20230713 to my local, made the changes and > re-compiled it and it worked locally. So it means that is related to the > implementation and they would have to fix or I do a downgrade to older > version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Description: Hello, I would like to report an issue with pyspark.pandas implementation on read_excel function. Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which potentially uses an older version of pandas on it's implementations of pyspark.pandas. The function read_excel from pandas doesn't expect a parameter called "squeeze" however it's implemented as part of pyspark.pandas and the parameter "squeeze" is being passed to the pandas function. !image-2023-11-28-13-20-40-275.png! I've been digging into it for further investigation into pyspark 3.4.1 documentation [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] This is the point I found that "squeeze" parameter is being passed to pandas read_excel function which is not expected. It seems like it was deprecated as part of pyspark 3.4.0 but still being used in the implementation. !image-2023-11-28-13-20-51-291.png! I believe this is an issue with pyspark implementation 3.4.1 not necessaily with fabric. However fabric uses this version as its 1.2 build. I am able to work around that for now by download the excel from the one lake to the spark driver, loading that to the memory with pandas and then converting to a spark dataframe etc or I made it work downgrading the build I downloaded the pyspark build 20230713 to my local, made the changes and re-compiled it and it worked locally. So it means that is related to the implementation and they would have to fix or I do a downgrade to older version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric was: Hello, I would like to report an issue with pyspark.pandas implementation on read_excel function. Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which potentially uses an older version of pandas on it's implementations of pyspark.pandas. The function read_excel from pandas doesn't expect a parameter called "squeeze" however it's implemented as part of pyspark.pandas and the parameter "squeeze" is being passed to the pandas function. !Z! I've been digging into it for further investigation into pyspark 3.4.1 documentation [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] This is the point I found that "squeeze" parameter is being passed to pandas read_excel function which is not expected. It seems like it was deprecated as part of pyspark 3.4.0 but still being used in the implementation. !9k=! I believe this is an issue with pyspark implementation 3.4.1 not necessaily with fabric. However fabric uses this version as its 1.2 build. I am able to work around that for now by download the excel from the one lake to the spark driver, loading that to the memory with pandas and then converting to a spark dataframe etc or I made it work downgrading the build I downloaded the pyspark build 20230713 to my local, made the changes and re-compiled it and it worked locally. So it means that is related to the implementation and they would have to fix or I do a downgrade to older version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: Apache spark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a
[jira] [Updated] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matheus Pavanetti updated SPARK-46143: -- Attachment: MicrosoftTeams-image.png > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: Apache spark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace. > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a parameter called > "squeeze" however it's implemented as part of pyspark.pandas and the > parameter "squeeze" is being passed to the pandas function. > > !Z! > > I've been digging into it for further investigation into pyspark 3.4.1 > documentation > [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] > > This is the point I found that "squeeze" parameter is being passed to pandas > read_excel function which is not expected. > It seems like it was deprecated as part of pyspark 3.4.0 but still being used > in the implementation. > > !9k=! > > I believe this is an issue with pyspark implementation 3.4.1 not necessaily > with fabric. However fabric uses this version as its 1.2 build. > > I am able to work around that for now by download the excel from the one lake > to the spark driver, loading that to the memory with pandas and then > converting to a spark dataframe etc or I made it work downgrading the build > I downloaded the pyspark build 20230713 to my local, made the changes and > re-compiled it and it worked locally. So it means that is related to the > implementation and they would have to fix or I do a downgrade to older > version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
Matheus Pavanetti created SPARK-46143: - Summary: pyspark.pandas read_excel implementation at version 3.4.1 Key: SPARK-46143 URL: https://issues.apache.org/jira/browse/SPARK-46143 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.4.1 Environment: Apache spark 3.4.1.5.3 build 20230713. Running on Microsoft Fabric workspace. Reporter: Matheus Pavanetti Hello, I would like to report an issue with pyspark.pandas implementation on read_excel function. Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which potentially uses an older version of pandas on it's implementations of pyspark.pandas. The function read_excel from pandas doesn't expect a parameter called "squeeze" however it's implemented as part of pyspark.pandas and the parameter "squeeze" is being passed to the pandas function. !Z! I've been digging into it for further investigation into pyspark 3.4.1 documentation [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] This is the point I found that "squeeze" parameter is being passed to pandas read_excel function which is not expected. It seems like it was deprecated as part of pyspark 3.4.0 but still being used in the implementation. !9k=! I believe this is an issue with pyspark implementation 3.4.1 not necessaily with fabric. However fabric uses this version as its 1.2 build. I am able to work around that for now by download the excel from the one lake to the spark driver, loading that to the memory with pandas and then converting to a spark dataframe etc or I made it work downgrading the build I downloaded the pyspark build 20230713 to my local, made the changes and re-compiled it and it worked locally. So it means that is related to the implementation and they would have to fix or I do a downgrade to older version like 3.3.0 or try the latest 3.5.0 which is not the case for fabric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46134) Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46134. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44048 [https://github.com/apache/spark/pull/44048] > Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite` > --- > > Key: SPARK-46134 > URL: https://issues.apache.org/jira/browse/SPARK-46134 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > In `ProtobufFunctionsSuite`, there are some cases where `df.take(1).toSeq(0)` > is used to get the first row in the DataFrame. This can be achieved by using > the `.first()` API, which looks clearer and more concise. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46134) Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46134: - Assignee: Yang Jie > Replace `df.take(1).toSeq(0)` with `df.first()` in `ProtobufFunctionsSuite` > --- > > Key: SPARK-46134 > URL: https://issues.apache.org/jira/browse/SPARK-46134 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > > In `ProtobufFunctionsSuite`, there are some cases where `df.take(1).toSeq(0)` > is used to get the first row in the DataFrame. This can be achieved by using > the `.first()` API, which looks clearer and more concise. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46142) Remove `dev/ansible-for-test-node` directory
[ https://issues.apache.org/jira/browse/SPARK-46142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46142: --- Labels: pull-request-available (was: ) > Remove `dev/ansible-for-test-node` directory > > > Key: SPARK-46142 > URL: https://issues.apache.org/jira/browse/SPARK-46142 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46142) Remove `dev/ansible-for-test-node` directory
Dongjoon Hyun created SPARK-46142: - Summary: Remove `dev/ansible-for-test-node` directory Key: SPARK-46142 URL: https://issues.apache.org/jira/browse/SPARK-46142 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length
[ https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-46098: - Priority: Minor (was: Major) > Reduce stack depth by replace (string|array).size with (string|array).length > > > Key: SPARK-46098 > URL: https://issues.apache.org/jira/browse/SPARK-46098 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Minor > Fix For: 4.0.0 > > > There are a lot of (string|array).size called. > In fact, the size calls the underlying length, this behavior increase the > stack length. > We should call (string|array).length directly. > We also get the compile waring Replace .size with .length on arrays and > strings -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length
[ https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-46098. -- Fix Version/s: 4.0.0 Resolution: Fixed > Reduce stack depth by replace (string|array).size with (string|array).length > > > Key: SPARK-46098 > URL: https://issues.apache.org/jira/browse/SPARK-46098 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Fix For: 4.0.0 > > > There are a lot of (string|array).size called. > In fact, the size calls the underlying length, this behavior increase the > stack length. > We should call (string|array).length directly. > We also get the compile waring Replace .size with .length on arrays and > strings -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46101) Replace (string|array).size with (string|array).length in all the modules
[ https://issues.apache.org/jira/browse/SPARK-46101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-46101. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44016 [https://github.com/apache/spark/pull/44016] > Replace (string|array).size with (string|array).length in all the modules > - > > Key: SPARK-46101 > URL: https://issues.apache.org/jira/browse/SPARK-46101 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46141) Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to CORRECTED
[ https://issues.apache.org/jira/browse/SPARK-46141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46141: --- Labels: pull-request-available (was: ) > Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to > CORRECTED > -- > > Key: SPARK-46141 > URL: https://issues.apache.org/jira/browse/SPARK-46141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > > spark.sql.legacy.ctePrecedencePolicy has been around for years and is > defaulted to EXCEPTION. > It is high time that we change it to corrected -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46129) Add GitHub link icon to PySpark documentation header
[ https://issues.apache.org/jira/browse/SPARK-46129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46129: - Assignee: Haejoon Lee > Add GitHub link icon to PySpark documentation header > > > Key: SPARK-46129 > URL: https://issues.apache.org/jira/browse/SPARK-46129 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Add GitHub link icon to PySpark documentation header for better accessibility > such as Pandas does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46129) Add GitHub link icon to PySpark documentation header
[ https://issues.apache.org/jira/browse/SPARK-46129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46129. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44043 [https://github.com/apache/spark/pull/44043] > Add GitHub link icon to PySpark documentation header > > > Key: SPARK-46129 > URL: https://issues.apache.org/jira/browse/SPARK-46129 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add GitHub link icon to PySpark documentation header for better accessibility > such as Pandas does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46119) Override toString method for UnresolvedAlias
[ https://issues.apache.org/jira/browse/SPARK-46119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46119: - Assignee: Yuming Wang > Override toString method for UnresolvedAlias > > > Key: SPARK-46119 > URL: https://issues.apache.org/jira/browse/SPARK-46119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46119) Override toString method for UnresolvedAlias
[ https://issues.apache.org/jira/browse/SPARK-46119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46119. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44038 [https://github.com/apache/spark/pull/44038] > Override toString method for UnresolvedAlias > > > Key: SPARK-46119 > URL: https://issues.apache.org/jira/browse/SPARK-46119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21
[ https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46139: - Assignee: Yang Jie > Fix QueryExecutionErrorsSuite with Java 21 > -- > > Key: SPARK-46139 > URL: https://issues.apache.org/jira/browse/SPARK-46139 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487] > {code:java} > [info] - FAILED_EXECUTE_UDF: execute user defined function with registered > UDF *** FAILED *** (42 milliseconds) > 15247[info] java.lang.IllegalArgumentException: For parameter 'reason' > value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds > for length 5' does not match: java.lang.StringIndexOutOfBoundsException: > begin 5, end 6, length 5 > 15248[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357) > 15249[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352) > 15250[info] at > scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > 15251[info] at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > 15252[info] at > scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21
[ https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46139. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44056 [https://github.com/apache/spark/pull/44056] > Fix QueryExecutionErrorsSuite with Java 21 > -- > > Key: SPARK-46139 > URL: https://issues.apache.org/jira/browse/SPARK-46139 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487] > {code:java} > [info] - FAILED_EXECUTE_UDF: execute user defined function with registered > UDF *** FAILED *** (42 milliseconds) > 15247[info] java.lang.IllegalArgumentException: For parameter 'reason' > value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds > for length 5' does not match: java.lang.StringIndexOutOfBoundsException: > begin 5, end 6, length 5 > 15248[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357) > 15249[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352) > 15250[info] at > scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > 15251[info] at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > 15252[info] at > scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46141) Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to CORRECTED
Serge Rielau created SPARK-46141: Summary: Change default of spark.sql.legacy.ctePrecedencePolicy from EXCEPTION to CORRECTED Key: SPARK-46141 URL: https://issues.apache.org/jira/browse/SPARK-46141 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Serge Rielau spark.sql.legacy.ctePrecedencePolicy has been around for years and is defaulted to EXCEPTION. It is high time that we change it to corrected -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106 ] jeanlyn deleted comment on SPARK-43106: - was (Author: jeanlyn): I think we also encountered similar problems, we circumvent this problem by using parameters *spark.sql.hive.convertInsertingPartitionedTable=false* > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vaibhav Beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. As > part of SPARK-19183, we did add a new hook in the commit protocol for this > exact same purpose, but seems like its default behavior is still to delete > the table data before the job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`
[ https://issues.apache.org/jira/browse/SPARK-46140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46140: --- Labels: pull-request-available (was: ) > Remove no longer needed JVM module options from the test submission options > of `HiveExternalCatalogVersionsSuite` > - > > Key: SPARK-46140 > URL: https://issues.apache.org/jira/browse/SPARK-46140 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > Complete TODO: > {code:java} > val args = Seq( > "--name", "prepare testing tables", > "--master", "local[2]", > "--conf", s"${UI_ENABLED.key}=false", > "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false", > "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion", > "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven", > "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}", > "--conf", s"spark.sql.test.version.index=$index", > "--driver-java-options", > s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " + > // TODO SPARK-37159 Consider to remove the following > // JVM module options once the Spark 3.2 line is EOL. > JavaModuleOptions.defaultModuleOptions(), > tempPyFile.getCanonicalPath) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46140) Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite`
Yang Jie created SPARK-46140: Summary: Remove no longer needed JVM module options from the test submission options of `HiveExternalCatalogVersionsSuite` Key: SPARK-46140 URL: https://issues.apache.org/jira/browse/SPARK-46140 Project: Spark Issue Type: Task Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Yang Jie Complete TODO: {code:java} val args = Seq( "--name", "prepare testing tables", "--master", "local[2]", "--conf", s"${UI_ENABLED.key}=false", "--conf", s"${MASTER_REST_SERVER_ENABLED.key}=false", "--conf", s"${HiveUtils.HIVE_METASTORE_VERSION.key}=$hiveMetastoreVersion", "--conf", s"${HiveUtils.HIVE_METASTORE_JARS.key}=maven", "--conf", s"${WAREHOUSE_PATH.key}=${wareHousePath.getCanonicalPath}", "--conf", s"spark.sql.test.version.index=$index", "--driver-java-options", s"-Dderby.system.home=${wareHousePath.getCanonicalPath} " + // TODO SPARK-37159 Consider to remove the following // JVM module options once the Spark 3.2 line is EOL. JavaModuleOptions.defaultModuleOptions(), tempPyFile.getCanonicalPath) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21
[ https://issues.apache.org/jira/browse/SPARK-46139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46139: --- Labels: pull-request-available (was: ) > Fix QueryExecutionErrorsSuite with Java 21 > -- > > Key: SPARK-46139 > URL: https://issues.apache.org/jira/browse/SPARK-46139 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487] > {code:java} > [info] - FAILED_EXECUTE_UDF: execute user defined function with registered > UDF *** FAILED *** (42 milliseconds) > 15247[info] java.lang.IllegalArgumentException: For parameter 'reason' > value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds > for length 5' does not match: java.lang.StringIndexOutOfBoundsException: > begin 5, end 6, length 5 > 15248[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357) > 15249[info] at > org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352) > 15250[info] at > scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > 15251[info] at > scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > 15252[info] at > scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46139) Fix QueryExecutionErrorsSuite with Java 21
Yang Jie created SPARK-46139: Summary: Fix QueryExecutionErrorsSuite with Java 21 Key: SPARK-46139 URL: https://issues.apache.org/jira/browse/SPARK-46139 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Yang Jie [https://github.com/apache/spark/actions/runs/7014008773/job/19081075487] {code:java} [info] - FAILED_EXECUTE_UDF: execute user defined function with registered UDF *** FAILED *** (42 milliseconds) 15247[info] java.lang.IllegalArgumentException: For parameter 'reason' value 'java.lang.StringIndexOutOfBoundsException: Range [5, 6) out of bounds for length 5' does not match: java.lang.StringIndexOutOfBoundsException: begin 5, end 6, length 5 15248[info] at org.apache.spark.SparkFunSuite.$anonfun$checkError$2(SparkFunSuite.scala:357) 15249[info] at org.apache.spark.SparkFunSuite.$anonfun$checkError$2$adapted(SparkFunSuite.scala:352) 15250[info] at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) 15251[info] at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) 15252[info] at scala.collection.AbstractIterable.foreach(Iterable.scala:933) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46138) Clean up the use of `SQLContext` in hive-thriftserver module
[ https://issues.apache.org/jira/browse/SPARK-46138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46138: --- Labels: pull-request-available (was: ) > Clean up the use of `SQLContext` in hive-thriftserver module > > > Key: SPARK-46138 > URL: https://issues.apache.org/jira/browse/SPARK-46138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46138) Clean up the use of `SQLContext` in hive-thriftserver module
Yang Jie created SPARK-46138: Summary: Clean up the use of `SQLContext` in hive-thriftserver module Key: SPARK-46138 URL: https://issues.apache.org/jira/browse/SPARK-46138 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46137) Janino compiler has new version that fix issue with compilation
[ https://issues.apache.org/jira/browse/SPARK-46137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46137: --- Labels: pull-request-available (was: ) > Janino compiler has new version that fix issue with compilation > --- > > Key: SPARK-46137 > URL: https://issues.apache.org/jira/browse/SPARK-46137 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Izek Greenfield >Priority: Major > Labels: pull-request-available > > Janino released a new version (3.1.11) with a fix of a compilation of: > {code:java} > { > } while (fasle) {code} > [Link to github issue|https://github.com/janino-compiler/janino/issues/208] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46137) Janino compiler has new version that fix issue with compilation
Izek Greenfield created SPARK-46137: --- Summary: Janino compiler has new version that fix issue with compilation Key: SPARK-46137 URL: https://issues.apache.org/jira/browse/SPARK-46137 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Izek Greenfield Janino released a new version (3.1.11) with a fix of a compilation of: {code:java} { } while (fasle) {code} [Link to github issue|https://github.com/janino-compiler/janino/issues/208] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org