[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22979 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r235618859 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala --- @@ -79,4 +83,22 @@ object CSVExprUtils { throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") } } + + def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String => java.math.BigDecimal = { +if (useLegacyParser) { + (s: String) => new BigDecimal(s.replaceAll(",", "")) +} else { + val df = new DecimalFormat("", new DecimalFormatSymbols(locale)) --- End diff -- renamed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r235588064 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala --- @@ -79,4 +83,22 @@ object CSVExprUtils { throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") } } + + def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String => java.math.BigDecimal = { +if (useLegacyParser) { + (s: String) => new BigDecimal(s.replaceAll(",", "")) +} else { + val df = new DecimalFormat("", new DecimalFormatSymbols(locale)) --- End diff -- let's name `df` `decimalFormat` .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r233353589 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -104,6 +106,14 @@ class UnivocityParser( requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray } + private val decimalParser = if (SQLConf.get.legacyDecimalParsing) { +(s: String) => new BigDecimal(s.replaceAll(",", "")) --- End diff -- I made parsing more strict. Should be fine now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r233242705 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -104,6 +106,14 @@ class UnivocityParser( requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray } + private val decimalParser = if (SQLConf.get.legacyDecimalParsing) { +(s: String) => new BigDecimal(s.replaceAll(",", "")) --- End diff -- I made some changes but a few tests in `CSVSuite` begun failing because Decimal type is inferred from a timestamps. For example, `2015` from `2015-08-20 15:57:00.0`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r233009506 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -104,6 +106,14 @@ class UnivocityParser( requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray } + private val decimalParser = if (SQLConf.get.legacyDecimalParsing) { +(s: String) => new BigDecimal(s.replaceAll(",", "")) --- End diff -- @MaxGekk, looks we should do the same thing in schema inference path as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232520110 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- Sounds good if that's not difficult. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232496879 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- > so I suggested to add a configuration or fallback for now ... What about SQL config `spark.sql.legacy.decimalParsing.enabled` with default value `false`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232487851 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- For instance, there was a similar try to change the date parsing library. I already know the different is quite breaking - so I suggested to add a configuration or fallback for now. Probably we should similarily just document the behaviour change in the migration guide but actually less sure yet even about this. anyway will take another look shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232487778 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- Ah, right. The previous codes will anyway throw an exception, I see. One thing I am a little bit unsure is how much different the behaviour is. For instance, looks the previous one handles sign character as well (`+` and `-`). Let me take a closer look. I think I need to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232487559 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- > is it safe that we assume this Number is BigDecimal? I am not absolutely sure that it always return `BigDecimal`. Found this at https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition) : ``` If isParseBigDecimal() is true, values are returned as BigDecimal objects. The values are the ones constructed by BigDecimal.BigDecimal(String) for corresponding strings in locale-independent format. The special cases negative and positive infinity and NaN are returned as Double instances holding the values of the corresponding Double constants. ``` So, `isParseBigDecimal()` returns `true` when `setParseBigDecimal` was called with `true` as in the PR. > Looks there are some possibilities that it can return other types. In that case we just fail with a cast exception and the record will be handled as a bad record. or you want to see more clear message in the exception? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232486778 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 + - Since Spark 3.0, to parse decimals in locale specific format from CSV, set the `locale` option to proper value. --- End diff -- While we are here, let's format the migration guide item format as others. like .. In Spark version 2.4 and earlier, it only deals with locale specific decimal notation like `,`. Since Spark 3.0, the locale can be set by `locale` and default locale is Please feel free to change as you think is righter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232486670 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -149,8 +156,8 @@ class UnivocityParser( case dt: DecimalType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => -val value = new BigDecimal(datum.replaceAll(",", "")) -Decimal(value, dt.precision, dt.scale) +val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] --- End diff -- @MaxGekk, is it safe that we assume this `Number` is `BigDecimal`? Looks there are some possibilities that it can return other types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232486599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -104,6 +105,12 @@ class UnivocityParser( requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray } + private val decimalParser = { +val df = new DecimalFormat("", new DecimalFormatSymbols(options.locale)) --- End diff -- not a big deal but I would just name it `decimalFormat` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232484798 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala --- @@ -226,4 +227,17 @@ class CsvExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with P InternalRow(17836)) // number of days from 1970-01-01 } } + + test("parse decimals using locale") { +Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach { langTag => + val schema = new StructType().add("d", DecimalType(10, 5)) + val options = Map("locale" -> langTag, "sep" -> "|") + val expected = Decimal(1000.001, 10, 5) + val df = new DecimalFormat("", new DecimalFormatSymbols(Locale.forLanguageTag(langTag))) + val input = df.format(expected.toBigDecimal) + checkEvaluation( +CsvToStructs(schema, options, Literal.create(input), gmtId), +InternalRow(expected)) +} + } --- End diff -- @MaxGekk, there's `UnivocityParserSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r232484751 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 + - Since Spark 3.0, to parse decimals in locale specific format from CSV, set the `locale` option to proper value. --- End diff -- @MaxGekk, it's not a behaviour change I guess. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22979 [SPARK-25977][SQL] Parsing decimals from CSV using locale ## What changes were proposed in this pull request? In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, `UnivocityParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. ## How was this patch tested? Added a test for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 decimal-parsing-locale Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22979.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22979 commit c567dccc487d9f495eb00ca412d5b48d7899f401 Author: Maxim Gekk Date: 2018-11-08T12:31:06Z Add a test commit 2b41eba3c666efc2b6860fd31e387839afe378d5 Author: Maxim Gekk Date: 2018-11-08T12:31:24Z Fix decimal parsing commit cf438ae2071f952eb72aab6dddf47c2aeb930d84 Author: Maxim Gekk Date: 2018-11-08T12:31:35Z Add locale option commit f9438c4b94d8a1dcf8b51d087cabae5a3c420db7 Author: Maxim Gekk Date: 2018-11-08T12:36:57Z Updating the migration guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org