[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536682#comment-16536682 ]
Marco Gaido edited comment on SPARK-24438 at 7/9/18 8:37 AM: ------------------------------------------------------------- IIRC, Hive has a placeholder string (\_\_HIVE_DEFAULT_PARTITION\_\_) for null value in partitions. was (Author: mgaido): IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value in partitions. > Empty strings and null strings are written to the same partition > ---------------------------------------------------------------- > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Mukul Murthy > Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org