Mukul Murthy created SPARK-24438: ------------------------------------ Summary: Empty strings and null strings are written to the same partition Key: SPARK-24438 URL: https://issues.apache.org/jira/browse/SPARK-24438 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Mukul Murthy
When you partition on a string column that has empty strings and nulls, they are both written to the same default partition. When you read the data back, all those values get read back as null. {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null)) val schema = new StructType().add("a", IntegerType).add("b", StringType) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) display(df) => a b 1 2 3 4 hello 5 null df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") val df2 = spark.read.load("/home/mukul/weird_test_data4") display(df2) => a b 4 hello 3 null 2 null 1 null 5 null {code} Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org