asddfl created SPARK-54666:
------------------------------
Summary: pandas-on-Spark to_numeric silently downcasts int64 to
float32, causing precision loss and value corruption
Key: SPARK-54666
URL: https://issues.apache.org/jira/browse/SPARK-54666
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.0.1
Environment: Platform: Ubuntu 24.04
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) [GCC
14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
sharing)
pyspark 4.0.1
pandas 2.3.3
pyarrow 22.0.0
Reporter: asddfl
When using pandas API on Spark (pyspark.pandas), calling to_numeric on an
integer Series unexpectedly downcasts the data from int64 to float32.
This behavior causes silent precision loss and numeric value corruption,
diverging from pandas semantics and violating numeric stability expectations.
{code:python}
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.pandas as ps
pd_t0 = pd.DataFrame(
{
'c0': [-1554478299, 2]
}
)
pd.set_option('display.float_format', lambda x: f"{x:.4f}")
print("Pandas:")
result = pd.to_numeric(pd_t0['c0'])
print(result)
spark = (
SparkSession.builder
.config("spark.sql.ansi.enabled", "false")
.getOrCreate()
)
ps_t0 = ps.DataFrame(
{
'c0': [-1554478299, 2]
}
)
print("PySpark Pandas:")
result = ps.to_numeric(ps_t0['c0'])
print(result)
{code}
{code:bash}
Pandas:
0 -1554478299
1 2
Name: c0, dtype: int64
PySpark Pandas:
0 -1554478336.0000
1 2.0000
Name: c0, dtype: float32
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]