[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2018-05-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20068


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2018-01-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r161378950
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,11 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+// The Univocity parser parses empty strings as `null` by default. 
This is the default behavior
+// for Spark too, since `nullValue` defaults to an empty string and 
has a higher precedence to
+// setEmptyValue(). But when `nullValue` is set to a different value, 
that would mean that the
+// empty string should be parsed not as `null` but as an empty string.
+writerSettings.setEmptyValue("")
--- End diff --

This conf is needed for avoiding the behavior changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2018-01-01 Thread aa8y
Github user aa8y commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r159160508
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,11 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+// The Univocity parser parses empty strings as `null` by default. 
This is the default behavior
+// for Spark too, since `nullValue` defaults to an empty string and 
has a higher precedence to
+// setEmptyValue(). But when `nullValue` is set to a different value, 
that would mean that the
+// empty string should be parsed not as `null` but as an empty string.
+writerSettings.setEmptyValue("")
--- End diff --

I talked about this with Hyukjin Kwon before. I think the previous behavior 
should _not_ be exposed as an option as the previous behavior was a bug. All it 
did was that it _always_ coerced empty values to `null`s. If the `nullValue` 
was not set, then the it was set to `""` by default which coerced `""` to 
`null`. The empty value being set to `""` had no affect in this case. If it was 
set to something else, say `\N`, then the empty value was also set to `\N` 
which resulted in parsing both `\N` and `""` to `null`, as `""` was no longer 
considered as an empty value and the `""` being coerced to null is the 
Univocity parser's default.

Setting empty value explicitly to the `""` literal would ensure that an 
empty string is always parsed as empty string, unless `nullValue` is not set or 
it is set to `""`, which is what people would do if they want `""` to be parsed 
as `null`, which would be the old behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r159118238
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,11 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+// The Univocity parser parses empty strings as `null` by default. 
This is the default behavior
+// for Spark too, since `nullValue` defaults to an empty string and 
has a higher precedence to
+// setEmptyValue(). But when `nullValue` is set to a different value, 
that would mean that the
+// empty string should be parsed not as `null` but as an empty string.
+writerSettings.setEmptyValue("")
--- End diff --

Please make it as a conf like what we did for `nullValue`?


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L613


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616812
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
--- End diff --

I think we don't need this (by `import testImplicits._` above).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616834
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
--- End diff --

`Seq(("bar"), (""), (null: String)).toDS`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616740
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("")
--- End diff --

Could we leave some comments here and update the PR description too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616591
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
--- End diff --

I think we can just use `spark`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158617095
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
+  dsIn.write
+.option("nullValue", nullValue)
+.csv(outDir)
+  val dsOut = spark.read
+.option("nullValue", nullValue)
+.schema(dsIn.schema)
+.csv(outDir)
+.as[(String)]
+  val computed = dsOut.collect.toSeq
+  val expected = Seq(("bar"), (null: String))
--- End diff --

I don't think this is quite the expected output? Could we use the examples 
provided in the JIRA rather than single row ones?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616912
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
+  val outDir = new File(dir, "out").getCanonicalPath
+  val nullValue = "\\N"
+
+  import sparkSession.implicits._
+  val dsIn = spark.createDataset(elems)
+  dsIn.write
+.option("nullValue", nullValue)
+.csv(outDir)
+  val dsOut = spark.read
+.option("nullValue", nullValue)
+.schema(dsIn.schema)
+.csv(outDir)
+.as[(String)]
+  val computed = dsOut.collect.toSeq
+  val expected = Seq(("bar"), (null: String))
+
+  assert(computed.size === 2)
+  assert(computed.sameElements(expected))
--- End diff --

We can use `checkAnswer(..: DataFrame, .. : DataFrame)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158616872
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1248,4 +1248,49 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
 )
   }
+
+  test("SPARK-17916: An empty string should not be coerced to null when 
nullValue is passed.") {
+val sparkSession = spark
+
+val elems = Seq(("bar"), (""), (null: String))
+
+// Checks for new behavior where an empty string is not coerced to 
null.
+withTempDir { dir =>
--- End diff --

We could do this `withTempPath { path =>`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-24 Thread aa8y
Github user aa8y commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158606107
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("")
--- End diff --

I disagree. I don't think the previous behavior should _not_ be exposed as 
an option as the previous behavior was a bug. All it did was that it _always_ 
coerced empty values to `null`s. If the `nullValue` was not set, then the it 
was set to `""` by default which coerced `""` to `null`. The empty value being 
set to `""` had no affect in this case. If it was set to something else, say 
`\N`, then the empty value was also set to `\N` which resulted in parsing both 
`\N` and `""` to `null`, as `""` was no longer considered as an empty value and 
the `""` being coerced to null is the Univocity parser's default.

Setting empty value explicitly to the `""` literal would ensure that an 
empty string is always parsed as empty string, unless `nullValue` is not set or 
it is set to `""`, which is what people would do if they want `""` to be parsed 
as `null`, which would be the old behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20068: [SPARK-17916][SQL] Fix empty string being parsed ...

2017-12-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20068#discussion_r158593580
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -152,7 +152,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("")
--- End diff --

Can we simply expose this as an option and keep the previous behaviour if 
this option is not set explicitly by the user?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org