asddfl created SPARK-54665:
------------------------------
Summary: pandas-on-Spark Boolean vs String comparison yields
inconsistent result with pandas
Key: SPARK-54665
URL: https://issues.apache.org/jira/browse/SPARK-54665
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.0.1
Environment: Platform: Ubuntu 24.04
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) [GCC
14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
sharing)
pyspark 4.0.1
pandas 2.3.3
pyarrow 22.0.0
Reporter: asddfl
When using pandas-on-Spark (pyspark.pandas / pandas API on Spark), comparing a
boolean Series with a string literal produces a result that is inconsistent
with native pandas.
This behavior diverges from pandas semantics and may cause silent logic
differences when running pandas-compatible code on Spark.
{code:python}
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.pandas as ps
pd_t1 = pd.DataFrame(
{
'c1': [True]
}
)
print("Pandas:")
print(pd_t1['c1'] == 'True')
spark = (
SparkSession.builder
.config("spark.sql.ansi.enabled", "false")
.getOrCreate()
)
ps_t1 = ps.DataFrame(
{
'c1': [True]
}
)
print("PySpark Pandas:")
print(ps_t1['c1'] == 'True')
{code}
{code:bash}
Pandas:
0 False
Name: c1, dtype: bool
PySpark Pandas:
0 True
Name: c1, dtype: bool
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]