[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536682#comment-16536682 ] Marco Gaido commented on SPARK-24438: - IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value in partitions. > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533977#comment-16533977 ] Mukul Murthy commented on SPARK-24438: -- Are null and empty string both invalid partition values? I kind of dislike that it's causing the actual data to be changed, although it is minor, and as you guys said it's actually a Hive bug so I don't think it's straightforward to fix. > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533929#comment-16533929 ] Dongjoon Hyun commented on SPARK-24438: --- Yep. It works as designed for now. Shall we close this issue? > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532775#comment-16532775 ] Wenchen Fan commented on SPARK-24438: - AFAIK this is the same behavior from Hive. null and empty string are both invalid partition values, so they are same when used as partition values. cc [~gatorsmile] [~dongjoon] > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition
[ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532392#comment-16532392 ] Liang-Chi Hsieh commented on SPARK-24438: - >From the code, looks like we intentionally treat empty string and null the >same as default partition name, though the dataframe read back doesn't make >such sense. cc [~cloud_fan] do you think this is a bug and we should fix it? > Empty strings and null strings are written to the same partition > > > Key: SPARK-24438 > URL: https://issues.apache.org/jira/browse/SPARK-24438 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > When you partition on a string column that has empty strings and nulls, they > are both written to the same default partition. When you read the data back, > all those values get read back as null. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.catalyst.encoders.RowEncoder > val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, > null)) > val schema = new StructType().add("a", IntegerType).add("b", StringType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > display(df) > => > a b > 1 > 2 > 3 > 4 hello > 5 null > df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") > val df2 = spark.read.load("/home/mukul/weird_test_data4") > display(df2) > => > a b > 4 hello > 3 null > 2 null > 1 null > 5 null > {code} > Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org