[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22979


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-21 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r235618859
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
 ---
@@ -79,4 +83,22 @@ object CSVExprUtils {
 throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
 }
   }
+
+  def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String 
=> java.math.BigDecimal = {
+if (useLegacyParser) {
+  (s: String) => new BigDecimal(s.replaceAll(",", ""))
+} else {
+  val df = new DecimalFormat("", new DecimalFormatSymbols(locale))
--- End diff --

renamed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r235588064
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
 ---
@@ -79,4 +83,22 @@ object CSVExprUtils {
 throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
 }
   }
+
+  def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String 
=> java.math.BigDecimal = {
+if (useLegacyParser) {
+  (s: String) => new BigDecimal(s.replaceAll(",", ""))
+} else {
+  val df = new DecimalFormat("", new DecimalFormatSymbols(locale))
--- End diff --

let's name `df` `decimalFormat` ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-14 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r233353589
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -104,6 +106,14 @@ class UnivocityParser(
 requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
   }
 
+  private val decimalParser = if (SQLConf.get.legacyDecimalParsing) {
+(s: String) => new BigDecimal(s.replaceAll(",", ""))
--- End diff --

I made parsing more strict. Should be fine now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-13 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r233242705
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -104,6 +106,14 @@ class UnivocityParser(
 requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
   }
 
+  private val decimalParser = if (SQLConf.get.legacyDecimalParsing) {
+(s: String) => new BigDecimal(s.replaceAll(",", ""))
--- End diff --

I made some changes but a few tests in `CSVSuite` begun failing because 
Decimal type is inferred from a timestamps. For example, `2015` from 
`2015-08-20 15:57:00.0`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r233009506
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -104,6 +106,14 @@ class UnivocityParser(
 requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
   }
 
+  private val decimalParser = if (SQLConf.get.legacyDecimalParsing) {
+(s: String) => new BigDecimal(s.replaceAll(",", ""))
--- End diff --

@MaxGekk, looks we should do the same thing in schema inference path as 
well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232520110
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

Sounds good if that's not difficult.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232496879
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

> so I suggested to add a configuration or fallback for now ...

What about SQL config `spark.sql.legacy.decimalParsing.enabled` with 
default value `false`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232487851
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

For instance, there was a similar try to change the date parsing library. I 
already know the different is quite breaking - so I suggested to add a 
configuration or fallback for now. Probably we should similarily just document 
the behaviour change in the migration guide but actually less sure yet even 
about this. anyway will take another look shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232487778
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

Ah, right. The previous codes will anyway throw an exception, I see. One 
thing I am a little bit unsure is how much different the behaviour is. For 
instance, looks the previous one handles sign character as well (`+` and `-`).

Let me take a closer look. I think I need to.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232487559
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

> is it safe that we assume this Number is BigDecimal?

I am not absolutely sure that it always return `BigDecimal`. Found this at 
https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition)
 :
```
If isParseBigDecimal() is true, values are returned as BigDecimal objects. 
The values are the ones constructed by BigDecimal.BigDecimal(String) for 
corresponding strings in locale-independent format. The special cases negative 
and positive infinity and NaN are returned as Double instances holding the 
values of the corresponding Double constants.
```
So, `isParseBigDecimal()` returns `true` when `setParseBigDecimal` was 
called with `true` as in the PR.

> Looks there are some possibilities that it can return other types.

In that case we just fail with a cast exception and the record will be 
handled as a bad record. or you want to see more clear message in the exception?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232486778
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
 
 ## Upgrading From Spark SQL 2.4 to 3.0
 
+  - Since Spark 3.0, to parse decimals in locale specific format from CSV, 
set the `locale` option to proper value.
--- End diff --

While we are here, let's format the migration guide item format as others. 
like ..

In Spark version 2.4 and earlier, it only deals with locale specific 
decimal notation like `,`. Since Spark 3.0, the locale can be set by `locale` 
and default locale is 

Please feel free to change as you think is righter


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232486670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -149,8 +156,8 @@ class UnivocityParser(
 
 case dt: DecimalType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
-val value = new BigDecimal(datum.replaceAll(",", ""))
-Decimal(value, dt.precision, dt.scale)
+val bigDecimal = 
decimalParser.parse(datum).asInstanceOf[BigDecimal]
--- End diff --

@MaxGekk, is it safe that we assume this `Number` is `BigDecimal`? Looks 
there are some possibilities that it can return other types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232486599
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -104,6 +105,12 @@ class UnivocityParser(
 requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
   }
 
+  private val decimalParser = {
+val df = new DecimalFormat("", new 
DecimalFormatSymbols(options.locale))
--- End diff --

not a big deal but I would just name it `decimalFormat`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232484798
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala
 ---
@@ -226,4 +227,17 @@ class CsvExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper with P
 InternalRow(17836)) // number of days from 1970-01-01
 }
   }
+
+  test("parse decimals using locale") {
+Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach { langTag =>
+  val schema = new StructType().add("d", DecimalType(10, 5))
+  val options = Map("locale" -> langTag, "sep" -> "|")
+  val expected = Decimal(1000.001, 10, 5)
+  val df = new DecimalFormat("", new 
DecimalFormatSymbols(Locale.forLanguageTag(langTag)))
+  val input = df.format(expected.toBigDecimal)
+  checkEvaluation(
+CsvToStructs(schema, options, Literal.create(input), gmtId),
+InternalRow(expected))
+}
+  }
--- End diff --

@MaxGekk, there's `UnivocityParserSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r232484751
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
 
 ## Upgrading From Spark SQL 2.4 to 3.0
 
+  - Since Spark 3.0, to parse decimals in locale specific format from CSV, 
set the `locale` option to proper value.
--- End diff --

@MaxGekk, it's not a behaviour change I guess.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-08 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22979

[SPARK-25977][SQL] Parsing decimals from CSV using locale

## What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse decimals from CSV 
input. After the changes, `UnivocityParser` converts input string to 
`BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`.

## How was this patch tested?

Added a test for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 decimal-parsing-locale

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22979


commit c567dccc487d9f495eb00ca412d5b48d7899f401
Author: Maxim Gekk 
Date:   2018-11-08T12:31:06Z

Add a test

commit 2b41eba3c666efc2b6860fd31e387839afe378d5
Author: Maxim Gekk 
Date:   2018-11-08T12:31:24Z

Fix decimal parsing

commit cf438ae2071f952eb72aab6dddf47c2aeb930d84
Author: Maxim Gekk 
Date:   2018-11-08T12:31:35Z

Add locale option

commit f9438c4b94d8a1dcf8b51d087cabae5a3c420db7
Author: Maxim Gekk 
Date:   2018-11-08T12:36:57Z

Updating the migration guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org