Ruifeng Zheng created SPARK-41855: ------------------------------------- Summary: `createDataFrame` doesn't handle None properly Key: SPARK-41855 URL: https://issues.apache.org/jira/browse/SPARK-41855 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng
{code:python} data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, value=None)] # +---+-----+ # | id|value| # +---+-----+ # | 1| NaN| # | 2| 42.0| # | 3| null| # +---+-----+ cdf = self.connect.createDataFrame(data) sdf = self.spark.createDataFrame(data) print() print() print(cdf._show_string(100, 100, False)) print() print(cdf.schema) print() print(sdf._jdf.showString(100, 100, False)) print() print(sdf.schema) self.compare_by_show(cdf, sdf) {code} {code:java} +---+-----+ | id|value| +---+-----+ | 1| null| | 2| 42.0| | 3| null| +---+-----+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) +---+-----+ | id|value| +---+-----+ | 1| NaN| | 2| 42.0| | 3| null| +---+-----+ StructType([StructField('id', LongType(), True), StructField('value', DoubleType(), True)]) {code} this issue is due to that `createDataFrame` can't handle None properly: 1, in the conversion from local data to pd.DataFrame, it automatically converts None to NaN 2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org