[jira] [Assigned] (SPARK-39822) Provides a good error during create Index with different dtype elements

Apache Spark (Jira) Wed, 20 Jul 2022 00:25:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-39822:
------------------------------------

    Assignee:     (was: Apache Spark)

> Provides a good error during create Index with different dtype elements
> -----------------------------------------------------------------------
>
>                 Key: SPARK-39822
>                 URL: https://issues.apache.org/jira/browse/SPARK-39822
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.2
>            Reporter: bo zhao
>            Priority: Minor
>
> PANDAS
>  
> {code:java}
> >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], 
> >>> dtype='object') >>> 
>  {code}
> PYSPARK
>  
>  
> {code:java}
> Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
> Spark context Web UI available at http://172.25.179.45:4042
> Spark context available as 'sc' (master = local[*], app id = 
> local-1658301116572).
> SparkSession available as 'spark'.
> >>> from pyspark import pandas as ps
> WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It 
> is required to set this environment variable to '1' in both driver and 
> executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you 
> but it does not work if there is a Spark context already launched.
> >>> ps.Index([1,2,'3',4])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, 
> in __new__
>     ps.from_pandas(
>   File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in 
> from_pandas
>     return DataFrame(pd.DataFrame(index=pobj)).index
>   File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in 
> __init__
>     internal = InternalFrame.from_pandas(pdf)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in 
> from_pandas
>     ) = InternalFrame.prepare_pandas_frame(pdf, 
> prefer_timestamp_ntz=prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in 
> prepare_pandas_frame
>     spark_type = infer_pd_series_spark_type(reset_index[col], dtype, 
> prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 360, in infer_pd_series_spark_type
>     return from_arrow_type(pa.Array.from_pandas(pser).type, 
> prefer_timestamp_ntz)
>   File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to 
> convert to int64
>  {code}
> I understand that pyspark pandas need the dtype to be the same, but we need a 
> good error msg or something to tell the user how to avoid.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39822) Provides a good error during create Index with different dtype elements

Reply via email to