[
https://issues.apache.org/jira/browse/SPARK-54518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Carlson updated SPARK-54518:
------------------------------------
Description:
It is possible to create a DataFrame with a schema including IntergerType and
DoubleType values that are then cast into StringType incorrectly. In the
attached notebook (also viewable via the html file) we can see that a DataFrama
is created in two normal ways with integers and floats that are then
inexplicably cast to strings without a path for reversal. The desired behavior
is to have a DataFrame created with the columns `INT_COL` to be an
`IntegerType` and `DOUBLE_COL` as a `DoubleType`.
[^DataFrame Creation Bug.ipynb]
[^DataFrame Creation Bug.html]
Code to replicate this:
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType,
StringType
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession.Builder().getOrCreate()
{code}
{code:java}
data_types = StructType(
[
StructField("STRING_COL", StringType()),
StructField("INT_COL", IntegerType()),
StructField("DOUBLE_COL", DoubleType()),
]
)
sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] ,
schema=data_types)
sdf.describe(){code}
When this cell is run, a DataFrame is returned with only StringType columns.
This is an error as `INT_COL` and `DOUBLE_COL` should be `IntegerType` and
`DoubleType` respectively.
{code:java}
cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType()))
cast_sdf.describe()
{code}
{code:java}
pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns =
["STRING_COL", "INT_COL", "DOUBLE_COL"])
pdf.describe()
new_sdf = spark.createDataFrame(pdf)
new_sdf.describe() {code}
was:
It is possible to create a DataFrame with a schema including IntergerType and
DoubleType values that are then cast into StringType incorrectly. In the
attached notebook (also viewable via the html file) we can see that a DataFrama
is created in two normal ways with integers and floats that are then
inexplicably cast to strings without a path for reversal. The desired behavior
is to have a DataFrame created with the columns `INT_COL` to be an
`IntegerType` and `DOUBLE_COL` as a `DoubleType`.
[^DataFrame Creation Bug.ipynb]
[^DataFrame Creation Bug.html]
Code to replicate this:
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType,
StringType
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession.Builder().getOrCreate()
{code}
{code:java}
data_types = StructType(
[
StructField("STRING_COL", StringType()),
StructField("INT_COL", IntegerType()),
StructField("DOUBLE_COL", DoubleType()),
]
)
sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] ,
schema=data_types)
sdf.describe() {code}
{code:java}
cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType()))
cast_sdf.describe()
{code}
{code:java}
pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns =
["STRING_COL", "INT_COL", "DOUBLE_COL"])
pdf.describe()
new_sdf = spark.createDataFrame(pdf)
new_sdf.describe() {code}
> PySpark 4.0.1 DataFrame Column Type Mismatch
> --------------------------------------------
>
> Key: SPARK-54518
> URL: https://issues.apache.org/jira/browse/SPARK-54518
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 4.0.0, 4.0.1
> Environment: I have a Macbook Pro an M2 Pro Chip. I'm using Python
> 3.10.18 and PySpark 4.0.1. My java/jdk info is pasted below.
>
> openjdk 17.0.16 2025-07-15
> OpenJDK Runtime Environment Homebrew (build 17.0.16+0)
> OpenJDK 64-Bit Server VM Homebrew (build 17.0.16+0, mixed mode, sharing)
> Reporter: Charles Carlson
> Priority: Major
> Attachments: DataFrame Creation Bug.html, DataFrame Creation
> Bug.ipynb, Screenshot 2025-11-25 at 6.47.38 PM.png
>
>
> It is possible to create a DataFrame with a schema including IntergerType and
> DoubleType values that are then cast into StringType incorrectly. In the
> attached notebook (also viewable via the html file) we can see that a
> DataFrama is created in two normal ways with integers and floats that are
> then inexplicably cast to strings without a path for reversal. The desired
> behavior is to have a DataFrame created with the columns `INT_COL` to be an
> `IntegerType` and `DOUBLE_COL` as a `DoubleType`.
> [^DataFrame Creation Bug.ipynb]
> [^DataFrame Creation Bug.html]
>
> Code to replicate this:
>
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType,
> DoubleType, StringType
> from pyspark.sql.functions import col
> import pandas as pd
> spark = SparkSession.Builder().getOrCreate()
> {code}
>
> {code:java}
> data_types = StructType(
> [
> StructField("STRING_COL", StringType()),
> StructField("INT_COL", IntegerType()),
> StructField("DOUBLE_COL", DoubleType()),
> ]
> )
> sdf = spark.createDataFrame([("Hello World", 1, 1 / 2), (None, None, None)] ,
> schema=data_types)
> sdf.describe(){code}
> When this cell is run, a DataFrame is returned with only StringType columns.
> This is an error as `INT_COL` and `DOUBLE_COL` should be `IntegerType` and
> `DoubleType` respectively.
> {code:java}
> cast_sdf = sdf.withColumn("NEW_INT_COL", col("INT_COL").cast(IntegerType()))
> cast_sdf.describe()
> {code}
> {code:java}
> pdf = pd.DataFrame([("Hello World", 1, 1 / 2), (None, None, None)], columns =
> ["STRING_COL", "INT_COL", "DOUBLE_COL"])
> pdf.describe()
> new_sdf = spark.createDataFrame(pdf)
> new_sdf.describe() {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]