[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread dhunziker
Github user dhunziker commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99953924
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 assert(result.schema === expectedSchema)
   }
 
+  test("test inferring booleans with custom values") {
+val result = spark.read
+  .format("csv")
+  .option("header", "true")
--- End diff --

Sure, will make the change. If preferred, I can change the whole suite to 
the typesafe options as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread dhunziker
Github user dhunziker commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99953321
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala
 ---
@@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite {
 assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", 
options) == StringType)
   }
 
+  test("Boolean field types are inferred correctly from custom value") {
+val options = new CSVOptions(Map("trueValue" -> "yes"))
+assert(CSVInferSchema.inferField(BooleanType, "yES", options) == 
BooleanType)
--- End diff --

No assert in this suite uses `===`. Should I make this consistent and 
replace all occurrences of `==` in that suite with `===` or just the one in the 
lines I've changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread dhunziker
Github user dhunziker commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99951876
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -110,7 +110,11 @@ private[csv] class UnivocityParser(
   }
 
 case _: BooleanType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options)(_.toBoolean)
+  nullSafeDatum(d, name, nullable, options) { datum =>
+if (datum.equalsIgnoreCase(options.trueValue)) true
--- End diff --

Just seems more obvious that `datum` is guaranteed to be not null given the 
name of `nullSafeDatum`. Probably should switch around the values in 
`CSVInferSchema.scala`, since there, `field` is null checked at the beginning 
of the parsing chain. Do you prefer one over the other?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99935530
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 assert(result.schema === expectedSchema)
   }
 
+  test("test inferring booleans with custom values") {
+val result = spark.read
+  .format("csv")
+  .option("header", "true")
--- End diff --

Could you use `true` (boolean literal) instead to promote the more 
type-safe approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99935349
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala
 ---
@@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite {
 assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", 
options) == StringType)
   }
 
+  test("Boolean field types are inferred correctly from custom value") {
+val options = new CSVOptions(Map("trueValue" -> "yes"))
+assert(CSVInferSchema.inferField(BooleanType, "yES", options) == 
BooleanType)
--- End diff --

What's the reason for `==` not `===` as it's in the other asserts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-07 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/16812#discussion_r99934599
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
 ---
@@ -110,7 +110,11 @@ private[csv] class UnivocityParser(
   }
 
 case _: BooleanType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options)(_.toBoolean)
+  nullSafeDatum(d, name, nullable, options) { datum =>
+if (datum.equalsIgnoreCase(options.trueValue)) true
--- End diff --

Is there any reason why you compare `datum` to `options.trueValue` and not 
vice versa as it's in `CSVInferSchema.scala` above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...

2017-02-05 Thread dhunziker
GitHub user dhunziker opened a pull request:

https://github.com/apache/spark/pull/16812

[SPARK-19465][SQL] Added options for custom boolean values in CSV

## What changes were proposed in this pull request?

It adds trueValue and falseValue options for customising the values used 
for Boolean types in CSV files. The default behaviour remains unchanged but the 
two new options allow for more flexibility. As it is currently the case, the 
values are matched ignoring their case.  

## How was this patch tested?

Extended existing unit tests for custom Boolean values and added a new test 
for asserting the IllegalArgumentException thrown by Scala's 
StringLike.toBoolean function.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dhunziker/spark feature/SPARK-19465

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16812.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16812


commit 0926626f3ab6411e405fb31c4a291d49092a19a4
Author: Dennis Hunziker 
Date:   2017-02-05T20:06:49Z

[SPARK-19465][SQL] Added options for custom boolean values in CSV




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org