[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536682#comment-16536682
 ] 

Marco Gaido commented on SPARK-24438:
-

IIRC, Hive has a placeholder string (__HIVE_DEFAULT_PARTITION__) for null value 
in partitions.

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-05 Thread Mukul Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533977#comment-16533977
 ] 

Mukul Murthy commented on SPARK-24438:
--

Are null and empty string both invalid partition values? I kind of dislike that 
it's causing the actual data to be changed, although it is minor, and as you 
guys said it's actually a Hive bug so I don't think it's straightforward to fix.

 

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533929#comment-16533929
 ] 

Dongjoon Hyun commented on SPARK-24438:
---

Yep. It works as designed for now. Shall we close this issue?

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532775#comment-16532775
 ] 

Wenchen Fan commented on SPARK-24438:
-

AFAIK this is the same behavior from Hive. null and empty string are both 
invalid partition values, so they are same when used as partition values. cc  
[~gatorsmile] [~dongjoon]

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-07-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532392#comment-16532392
 ] 

Liang-Chi Hsieh commented on SPARK-24438:
-

>From the code, looks like we intentionally treat empty string and null the 
>same as default partition name, though the dataframe read back doesn't make 
>such sense.

cc [~cloud_fan] do you think this is a bug and we should fix it?

> Empty strings and null strings are written to the same partition
> 
>
> Key: SPARK-24438
> URL: https://issues.apache.org/jira/browse/SPARK-24438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they 
> are both written to the same default partition. When you read the data back, 
> all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
> null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org