[ https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-39822: ------------------------------------ Assignee: (was: Apache Spark) > Provides a good error during create Index with different dtype elements > ----------------------------------------------------------------------- > > Key: SPARK-39822 > URL: https://issues.apache.org/jira/browse/SPARK-39822 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.2 > Reporter: bo zhao > Priority: Minor > > PANDAS > > {code:java} > >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], > >>> dtype='object') >>> > {code} > PYSPARK > > > {code:java} > Using Python version 3.8.13 (default, Jun 29 2022 11:50:19) > Spark context Web UI available at http://172.25.179.45:4042 > Spark context available as 'sc' (master = local[*], app id = > local-1658301116572). > SparkSession available as 'spark'. > >>> from pyspark import pandas as ps > WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It > is required to set this environment variable to '1' in both driver and > executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you > but it does not work if there is a Spark context already launched. > >>> ps.Index([1,2,'3',4]) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, > in __new__ > ps.from_pandas( > File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in > from_pandas > return DataFrame(pd.DataFrame(index=pobj)).index > File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in > __init__ > internal = InternalFrame.from_pandas(pdf) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in > from_pandas > ) = InternalFrame.prepare_pandas_frame(pdf, > prefer_timestamp_ntz=prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in > prepare_pandas_frame > spark_type = infer_pd_series_spark_type(reset_index[col], dtype, > prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 360, in infer_pd_series_spark_type > return from_arrow_type(pa.Array.from_pandas(pser).type, > prefer_timestamp_ntz) > File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 312, in pyarrow.lib.array > File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to > convert to int64 > {code} > I understand that pyspark pandas need the dtype to be the same, but we need a > good error msg or something to tell the user how to avoid. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org