[jira] [Created] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

asddfl (Jira) Wed, 10 Dec 2025 06:33:44 -0800

asddfl created SPARK-54666:
------------------------------

             Summary: pandas-on-Spark to_numeric silently downcasts int64 to 
float32, causing precision loss and value corruption
                 Key: SPARK-54666
                 URL: https://issues.apache.org/jira/browse/SPARK-54666
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.0.1
         Environment: Platform: Ubuntu 24.04 
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) [GCC 
14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
sharing)
pyspark 4.0.1
pandas 2.3.3
pyarrow 22.0.0
            Reporter: asddfl



When using pandas API on Spark (pyspark.pandas), calling to_numeric on an 
integer Series unexpectedly downcasts the data from int64 to float32.

This behavior causes silent precision loss and numeric value corruption, 
diverging from pandas semantics and violating numeric stability expectations.

{code:python}
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.pandas as ps

pd_t0 = pd.DataFrame(
    {
        'c0': [-1554478299, 2]
    }
)

pd.set_option('display.float_format', lambda x: f"{x:.4f}")

print("Pandas:")
result = pd.to_numeric(pd_t0['c0'])
print(result)

spark = (
    SparkSession.builder
    .config("spark.sql.ansi.enabled", "false")
    .getOrCreate()
)

ps_t0 = ps.DataFrame(
    {
        'c0': [-1554478299, 2]
    }
)

print("PySpark Pandas:")
result = ps.to_numeric(ps_t0['c0'])
print(result)
{code}

{code:bash}
Pandas:
0   -1554478299
1             2
Name: c0, dtype: int64

PySpark Pandas:
0   -1554478336.0000                                                            
1             2.0000
Name: c0, dtype: float32
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

Reply via email to