[
https://issues.apache.org/jira/browse/SPARK-54518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Carlson updated SPARK-54518:
------------------------------------
Attachment: Screenshot 2025-11-25 at 6.47.38 PM.png
> PySpark 4.0.1 DataFrame Column Type Mismatch
> --------------------------------------------
>
> Key: SPARK-54518
> URL: https://issues.apache.org/jira/browse/SPARK-54518
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 4.0.0, 4.0.1
> Environment: I have a Macbook Pro an M2 Pro Chip. I'm using Python
> 3.10.18 and PySpark 4.0.1. My java/jdk info is pasted below.
>
> openjdk 17.0.16 2025-07-15
> OpenJDK Runtime Environment Homebrew (build 17.0.16+0)
> OpenJDK 64-Bit Server VM Homebrew (build 17.0.16+0, mixed mode, sharing)
> Reporter: Charles Carlson
> Priority: Major
> Attachments: DataFrame Creation Bug.html, DataFrame Creation
> Bug.ipynb, Screenshot 2025-11-25 at 6.47.38 PM.png
>
>
> It is possible to create a DataFrame with a schema including IntergerType and
> DoubleType values that are then cast into StringType incorrectly. In this
> attached notebook photo we can see that a DataFrama is created in two normal
> ways with integers and floats that are then inexplicably cast to strings
> without a path for reversal. The desired behavior is to have a DataFrame
> created with the columns `INT_COL` to be an `IntegerType` and `DOUBLE_COL` as
> a `DoubleType`.
> !image-2025-11-25-18-47-57-623.png!
>
>
> Code to replicate this:
>
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType,
> DoubleType, StringType
> from pyspark.sql.functions import col
> import pandas as pd
> spark = SparkSession.Builder().getOrCreate()
> {code}
>
> {code:java}
> data_types = StructType(
> [
> StructField("STRING_COL", StringType()),
> StructField("INT_COL", IntegerType()),
> StructField("DOUBLE_COL", DoubleType()),
> ]
> )
> sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] ,
> schema=data_types)
> sdf.describe() {code}
> {code:java}
> cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType()))
> cast_sdf.describe()
> {code}
> {code:java}
> pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns =
> ["STRING_COL", "INT_COL", "DOUBLE_COL"])
> pdf.describe()
> new_sdf = spark.createDataFrame(pdf)
> new_sdf.describe() {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]