[jira] [Commented] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Sahil Kumar Singh (Jira) Thu, 08 Jan 2026 12:20:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050689#comment-18050689
 ]


Sahil Kumar Singh commented on SPARK-54518:
-------------------------------------------

Hi Charles Carlson, this is not a bug.

 

When you run {*}sdf.describe(){*}, it returns a new DataFrame that describes 
(provides statistics) the given DataFrame.

([https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.describe.html)]

 

Output like: _DataFrame[summary: string, INT_COL: string, STRING_COL: string, 
DOUBLE_COL: string]_ describes the datatype of the headers of the columns of 
the returned stats DataFrame.

 

Only when you run *sdf.describe().show()* you'll see the count, mean, stddev, 
min, and max of all numerical and string columns.

 

To retrieve the data type of the columns, use the *sdf.printSchema()*

ref: 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.printSchema.html]

 

> PySpark 4.0.1 DataFrame Column Type Mismatch
> --------------------------------------------
>
>                 Key: SPARK-54518
>                 URL: https://issues.apache.org/jira/browse/SPARK-54518
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 4.0.0, 4.0.1
>         Environment: I have a Macbook Pro an M2 Pro Chip. I'm using Python 
> 3.10.18 and PySpark 4.0.1. My java/jdk info is pasted below.
>  
> openjdk 17.0.16 2025-07-15
> OpenJDK Runtime Environment Homebrew (build 17.0.16+0)
> OpenJDK 64-Bit Server VM Homebrew (build 17.0.16+0, mixed mode, sharing)
>            Reporter: Charles Carlson
>            Priority: Major
>         Attachments: DataFrame Creation Bug.html, DataFrame Creation 
> Bug.ipynb, Screenshot 2025-11-25 at 6.47.38 PM.png
>
>
> It is possible to create a DataFrame with a schema including IntergerType and 
> DoubleType values that are then cast into StringType incorrectly. In the 
> attached notebook (also viewable via the html file) we can see that a 
> DataFrama is created in two normal ways with integers and floats that are 
> then inexplicably cast to strings without a path for reversal. The desired 
> behavior is to have a DataFrame created with the columns `INT_COL` to be an 
> `IntegerType` and `DOUBLE_COL` as a `DoubleType`. 
> [^DataFrame Creation Bug.ipynb]
> [^DataFrame Creation Bug.html]
>  
> Code to replicate this:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> DoubleType, StringType
> from pyspark.sql.functions import col
> import pandas as pd
> spark = SparkSession.Builder().getOrCreate()
> {code}
>  
> {code:java}
> data_types = StructType(
>     [
>         StructField("STRING_COL", StringType()),
>         StructField("INT_COL", IntegerType()),
>         StructField("DOUBLE_COL", DoubleType()),
>     ]
> )
> sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] , 
> schema=data_types)
> sdf.describe(){code}
> When this cell is run, a DataFrame is returned with only StringType columns. 
> This is an error as `INT_COL` and `DOUBLE_COL` should be `IntegerType` and 
> `DoubleType` respectively.
> {code:java}
> cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType()))
> cast_sdf.describe()
> {code}
> When this cell is run it shows that `NEW_INT_COL` is still a `StringType`, 
> which is a bug as it was cast as an `IntegerType`.
> {code:java}
> pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns = 
> ["STRING_COL", "INT_COL", "DOUBLE_COL"])
> pdf.describe()
> new_sdf = spark.createDataFrame(pdf)
> new_sdf.describe() {code}
> When this cell is run the pandas DF has correctly typed columns and the Spark 
> DataFrame has only StringType columns, which is again an issue. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Reply via email to