Devin Petersohn created SPARK-55977:
---------------------------------------
Summary: isin() should not match values of incompatible types
Key: SPARK-55977
URL: https://issues.apache.org/jira/browse/SPARK-55977
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark
Affects Versions: 4.1.1
Reporter: Devin Petersohn
DataFrame.isin() and Series.isin() return True when comparing values of
incompatible types. Spark's implicit type coercion causes string "1" to match
integer 1, while pandas uses strict type matching and returns False.
{{import pandas as pd}}
{{import pyspark.pandas as ps}}
{{# DataFrame.isin with list}}
{{pdf = pd.DataFrame(\{"a": [1, 2, 3]})}}
{{{}psdf = ps.from_pandas(pdf){}}}{{{}pdf.isin(["1", "2"])["a"].tolist() #
[False, False, False] {}}}
{{psdf.isin(["1", "2"])["a"].tolist() # [True, True, False]}}
{{# DataFrame.isin with dict}}
{{pdf.isin(\{"a": ["1", "2"]})["a"].tolist() # [False, False, False]}}
{{psdf.isin(\{"a": ["1", "2"]})["a"].tolist() # [True, True, False]}}
{{# Series.isin }}
{{pd.Series([1, 2, 3]).isin(["1", "2"]).tolist() # [False, False, False] }}
{{ps.Series([1, 2, 3]).isin(["1", "2"]).tolist() # [True, True, False]}}
{{# Numeric cross-type works correctly in both (int col, float values)}}
{{pd.Series([1, 2, 3]).isin([1.0, 2.0]).tolist() # [True, True, False] }}
{{ps.Series([1, 2, 3]).isin([1.0, 2.0]).tolist() # [True, True, False]}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]