[ https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-18076. ------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15610 [https://github.com/apache/spark/pull/15610] > Fix default Locale used in DateFormat, NumberFormat to Locale.US > ---------------------------------------------------------------- > > Key: SPARK-18076 > URL: https://issues.apache.org/jira/browse/SPARK-18076 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL > Affects Versions: 2.0.1 > Reporter: Sean Owen > Assignee: Sean Owen > Labels: releasenotes > Fix For: 2.1.0 > > > Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. > Although the behavior of these format is mostly determined by things like > format strings, the exact behavior can vary according to the platform's > default locale. Although the locale defaults to "en", it can be set to > something else by env variables. And if it does, it can cause the same code > to succeed or fail based just on locale: > {code} > import java.text._ > import java.util._ > def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd", > l).parse(s) > parse("1989Dec31", Locale.US) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.UK) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.CHINA) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(<console>:18) > ... 32 elided > parse("1989Dec31", Locale.GERMANY) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(<console>:18) > ... 32 elided > {code} > Where not otherwise specified, I believe all instances in the code should > default to some fixed value, and that should probably be {{Locale.US}}. This > matches the JVM's default, and specifies both language ("en") and region > ("US") to remove ambiguity. This most closely matches what the current code > behavior would be (unless default locale was changed), because it will > currently default to "en". > This affects SQL date/time functions. At the moment, the only SQL function > that lets the user specify language/country is "sentences", which is > consistent with Hive. > It affects dates passed in the JSON API. > It affects some strings rendered in the UI, potentially. Although this isn't > a correctness issue, there may be an argument for not letting that vary (?) > It affects a bunch of instances where dates are formatted into strings for > things like IDs or file names, which is far less likely to cause a problem, > but worth making consistent. > The other occurrences are in tests. > The downside to this change is also its upside: the behavior doesn't depend > on default JVM locale, but, also can't be affected by the default JVM locale. > For example, if you wanted to parse some dates in a way that depended on an > non-US locale (not just the format string) then it would no longer be > possible. There's no means of specifying this, for example, in SQL functions > for parsing dates. However, controlling this by globally changing the locale > isn't exactly great either. > The purpose of this change is to make the current default behavior > deterministic and fixed. PR coming. > CC [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org