[jira] [Updated] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Charles Carlson (Jira) Tue, 25 Nov 2025 15:56:52 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Charles Carlson updated SPARK-54518:
------------------------------------
    Attachment: Screenshot 2025-11-25 at 6.47.38 PM.png

> PySpark 4.0.1 DataFrame Column Type Mismatch
> --------------------------------------------
>
>                 Key: SPARK-54518
>                 URL: https://issues.apache.org/jira/browse/SPARK-54518
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 4.0.0, 4.0.1
>         Environment: I have a Macbook Pro an M2 Pro Chip. I'm using Python 
> 3.10.18 and PySpark 4.0.1. My java/jdk info is pasted below.
>  
> openjdk 17.0.16 2025-07-15
> OpenJDK Runtime Environment Homebrew (build 17.0.16+0)
> OpenJDK 64-Bit Server VM Homebrew (build 17.0.16+0, mixed mode, sharing)
>            Reporter: Charles Carlson
>            Priority: Major
>         Attachments: DataFrame Creation Bug.html, DataFrame Creation 
> Bug.ipynb, Screenshot 2025-11-25 at 6.47.38 PM.png
>
>
> It is possible to create a DataFrame with a schema including IntergerType and 
> DoubleType values that are then cast into StringType incorrectly. In this 
> attached notebook photo we can see that a DataFrama is created in two normal 
> ways with integers and floats that are then inexplicably cast to strings 
> without a path for reversal. The desired behavior is to have a DataFrame 
> created with the columns `INT_COL` to be an `IntegerType` and `DOUBLE_COL` as 
> a `DoubleType`. 
> !image-2025-11-25-18-47-57-623.png!
>  
>  
> Code to replicate this:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> DoubleType, StringType
> from pyspark.sql.functions import col
> import pandas as pd
> spark = SparkSession.Builder().getOrCreate()
> {code}
>  
> {code:java}
> data_types = StructType(
>     [
>         StructField("STRING_COL", StringType()),
>         StructField("INT_COL", IntegerType()),
>         StructField("DOUBLE_COL", DoubleType()),
>     ]
> )
> sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] , 
> schema=data_types)
> sdf.describe() {code}
> {code:java}
> cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType())) 
> cast_sdf.describe()
> {code}
> {code:java}
> pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns = 
> ["STRING_COL", "INT_COL", "DOUBLE_COL"])
> pdf.describe()
> new_sdf = spark.createDataFrame(pdf)
> new_sdf.describe() {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Reply via email to