asddfl created SPARK-54665:
------------------------------

             Summary: pandas-on-Spark Boolean vs String comparison yields 
inconsistent result with pandas
                 Key: SPARK-54665
                 URL: https://issues.apache.org/jira/browse/SPARK-54665
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.0.1
         Environment: Platform: Ubuntu 24.04 
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) [GCC 
14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
sharing)
pyspark 4.0.1
pandas 2.3.3
pyarrow 22.0.0
            Reporter: asddfl


When using pandas-on-Spark (pyspark.pandas / pandas API on Spark), comparing a 
boolean Series with a string literal produces a result that is inconsistent 
with native pandas.

This behavior diverges from pandas semantics and may cause silent logic 
differences when running pandas-compatible code on Spark.

{code:python}
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.pandas as ps

pd_t1 = pd.DataFrame(
    {
        'c1': [True]
    }
)

print("Pandas:")
print(pd_t1['c1'] == 'True')

spark = (
    SparkSession.builder
    .config("spark.sql.ansi.enabled", "false")
    .getOrCreate()
)

ps_t1 = ps.DataFrame(
    {
        'c1': [True]
    }
)

print("PySpark Pandas:")
print(ps_t1['c1'] == 'True')
{code}

{code:bash}
Pandas:
0    False
Name: c1, dtype: bool

PySpark Pandas:
0    True                                                                       
Name: c1, dtype: bool
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to