[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user dhunziker commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99953924 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { assert(result.schema === expectedSchema) } + test("test inferring booleans with custom values") { +val result = spark.read + .format("csv") + .option("header", "true") --- End diff -- Sure, will make the change. If preferred, I can change the whole suite to the typesafe options as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user dhunziker commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99953321 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala --- @@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite { assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", options) == StringType) } + test("Boolean field types are inferred correctly from custom value") { +val options = new CSVOptions(Map("trueValue" -> "yes")) +assert(CSVInferSchema.inferField(BooleanType, "yES", options) == BooleanType) --- End diff -- No assert in this suite uses `===`. Should I make this consistent and replace all occurrences of `==` in that suite with `===` or just the one in the lines I've changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user dhunziker commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99951876 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -110,7 +110,11 @@ private[csv] class UnivocityParser( } case _: BooleanType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toBoolean) + nullSafeDatum(d, name, nullable, options) { datum => +if (datum.equalsIgnoreCase(options.trueValue)) true --- End diff -- Just seems more obvious that `datum` is guaranteed to be not null given the name of `nullSafeDatum`. Probably should switch around the values in `CSVInferSchema.scala`, since there, `field` is null checked at the beginning of the parsing chain. Do you prefer one over the other? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99935530 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -139,6 +140,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { assert(result.schema === expectedSchema) } + test("test inferring booleans with custom values") { +val result = spark.read + .format("csv") + .option("header", "true") --- End diff -- Could you use `true` (boolean literal) instead to promote the more type-safe approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99935349 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala --- @@ -73,6 +73,12 @@ class CSVInferSchemaSuite extends SparkFunSuite { assert(CSVInferSchema.inferField(LongType, "2015-08 14:49:00", options) == StringType) } + test("Boolean field types are inferred correctly from custom value") { +val options = new CSVOptions(Map("trueValue" -> "yes")) +assert(CSVInferSchema.inferField(BooleanType, "yES", options) == BooleanType) --- End diff -- What's the reason for `==` not `===` as it's in the other asserts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/16812#discussion_r99934599 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -110,7 +110,11 @@ private[csv] class UnivocityParser( } case _: BooleanType => (d: String) => - nullSafeDatum(d, name, nullable, options)(_.toBoolean) + nullSafeDatum(d, name, nullable, options) { datum => +if (datum.equalsIgnoreCase(options.trueValue)) true --- End diff -- Is there any reason why you compare `datum` to `options.trueValue` and not vice versa as it's in `CSVInferSchema.scala` above? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16812: [SPARK-19465][SQL] Added options for custom boole...
GitHub user dhunziker opened a pull request: https://github.com/apache/spark/pull/16812 [SPARK-19465][SQL] Added options for custom boolean values in CSV ## What changes were proposed in this pull request? It adds trueValue and falseValue options for customising the values used for Boolean types in CSV files. The default behaviour remains unchanged but the two new options allow for more flexibility. As it is currently the case, the values are matched ignoring their case. ## How was this patch tested? Extended existing unit tests for custom Boolean values and added a new test for asserting the IllegalArgumentException thrown by Scala's StringLike.toBoolean function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhunziker/spark feature/SPARK-19465 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16812 commit 0926626f3ab6411e405fb31c4a291d49092a19a4 Author: Dennis Hunziker Date: 2017-02-05T20:06:49Z [SPARK-19465][SQL] Added options for custom boolean values in CSV --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org