[jira] [Updated] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Charles Carlson (Jira) Tue, 25 Nov 2025 16:18:22 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Charles Carlson updated SPARK-54518:
------------------------------------
    Description: 
It is possible to create a DataFrame with a schema including IntergerType and 
DoubleType values that are then cast into StringType incorrectly. In the 
attached notebook (also viewable via the html file) we can see that a DataFrama 
is created in two normal ways with integers and floats that are then 
inexplicably cast to strings without a path for reversal. The desired behavior 
is to have a DataFrame created with the columns `INT_COL` to be an 
`IntegerType` and `DOUBLE_COL` as a `DoubleType`. 

[^DataFrame Creation Bug.ipynb]

[^DataFrame Creation Bug.html]

 

Code to replicate this:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, 
StringType
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession.Builder().getOrCreate()
{code}
 
{code:java}
data_types = StructType(
    [
        StructField("STRING_COL", StringType()),
        StructField("INT_COL", IntegerType()),
        StructField("DOUBLE_COL", DoubleType()),
    ]
)

sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] , 
schema=data_types)
sdf.describe(){code}
When this cell is run, a DataFrame is returned with only StringType columns. 
This is an error as `INT_COL` and `DOUBLE_COL` should be `IntegerType` and 
`DoubleType` respectively.
{code:java}
cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType())) 
cast_sdf.describe()
{code}
{code:java}
pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns = 
["STRING_COL", "INT_COL", "DOUBLE_COL"])
pdf.describe()
new_sdf = spark.createDataFrame(pdf)
new_sdf.describe() {code}
 

 

  was:
It is possible to create a DataFrame with a schema including IntergerType and 
DoubleType values that are then cast into StringType incorrectly. In the 
attached notebook (also viewable via the html file) we can see that a DataFrama 
is created in two normal ways with integers and floats that are then 
inexplicably cast to strings without a path for reversal. The desired behavior 
is to have a DataFrame created with the columns `INT_COL` to be an 
`IntegerType` and `DOUBLE_COL` as a `DoubleType`. 

[^DataFrame Creation Bug.ipynb]

[^DataFrame Creation Bug.html]

 

Code to replicate this:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, 
StringType
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession.Builder().getOrCreate()
{code}
 
{code:java}
data_types = StructType(
    [
        StructField("STRING_COL", StringType()),
        StructField("INT_COL", IntegerType()),
        StructField("DOUBLE_COL", DoubleType()),
    ]
)

sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] , 
schema=data_types)
sdf.describe() {code}
{code:java}
cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType())) 
cast_sdf.describe()
{code}
{code:java}
pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns = 
["STRING_COL", "INT_COL", "DOUBLE_COL"])
pdf.describe()
new_sdf = spark.createDataFrame(pdf)
new_sdf.describe() {code}
 

 


> PySpark 4.0.1 DataFrame Column Type Mismatch
> --------------------------------------------
>
>                 Key: SPARK-54518
>                 URL: https://issues.apache.org/jira/browse/SPARK-54518
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 4.0.0, 4.0.1
>         Environment: I have a Macbook Pro an M2 Pro Chip. I'm using Python 
> 3.10.18 and PySpark 4.0.1. My java/jdk info is pasted below.
>  
> openjdk 17.0.16 2025-07-15
> OpenJDK Runtime Environment Homebrew (build 17.0.16+0)
> OpenJDK 64-Bit Server VM Homebrew (build 17.0.16+0, mixed mode, sharing)
>            Reporter: Charles Carlson
>            Priority: Major
>         Attachments: DataFrame Creation Bug.html, DataFrame Creation 
> Bug.ipynb, Screenshot 2025-11-25 at 6.47.38 PM.png
>
>
> It is possible to create a DataFrame with a schema including IntergerType and 
> DoubleType values that are then cast into StringType incorrectly. In the 
> attached notebook (also viewable via the html file) we can see that a 
> DataFrama is created in two normal ways with integers and floats that are 
> then inexplicably cast to strings without a path for reversal. The desired 
> behavior is to have a DataFrame created with the columns `INT_COL` to be an 
> `IntegerType` and `DOUBLE_COL` as a `DoubleType`. 
> [^DataFrame Creation Bug.ipynb]
> [^DataFrame Creation Bug.html]
>  
> Code to replicate this:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> DoubleType, StringType
> from pyspark.sql.functions import col
> import pandas as pd
> spark = SparkSession.Builder().getOrCreate()
> {code}
>  
> {code:java}
> data_types = StructType(
>     [
>         StructField("STRING_COL", StringType()),
>         StructField("INT_COL", IntegerType()),
>         StructField("DOUBLE_COL", DoubleType()),
>     ]
> )
> sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] , 
> schema=data_types)
> sdf.describe(){code}
> When this cell is run, a DataFrame is returned with only StringType columns. 
> This is an error as `INT_COL` and `DOUBLE_COL` should be `IntegerType` and 
> `DoubleType` respectively.
> {code:java}
> cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType())) 
> cast_sdf.describe()
> {code}
> {code:java}
> pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns = 
> ["STRING_COL", "INT_COL", "DOUBLE_COL"])
> pdf.describe()
> new_sdf = spark.createDataFrame(pdf)
> new_sdf.describe() {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54518) PySpark 4.0.1 DataFrame Column Type Mismatch

Reply via email to