Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

Pietro Pugni Mon, 24 Oct 2016 08:05:09 -0700

Thank you!

I tried again setting locale options in different ways but doesn’t propagate to 
the JVM. I tested these strategies (alone and all together):
- bin/spark-submit --conf "spark.executor.extraJavaOptions=-Duser.language=en 
-Duser.region=US -Duser.country=US -Duser.timezone=GMT” test.py
- spark = SparkSession \
        .builder \
        .appName("My app") \
        .config("spark.executor.extraJavaOptions", "-Duser.language=en 
-Duser.region=US -Duser.country=US -Duser.timezone=GMT") \
        .config("user.country", "US") \
        .config("user.region", "US") \
        .config("user.language", "en") \
        .config("user.timezone", "GMT") \
        .config("-Duser.country", "US") \
        .config("-Duser.region", "US") \
        .config("-Duser.language", "en") \
        .config("-Duser.timezone", "GMT") \
        .getOrCreate()
- export JAVA_OPTS="-Duser.language=en -Duser.region=US -Duser.country=US 
-Duser.timezone=GMT”
- export LANG="en_US.UTF-8”


After running export LANG="en_US.UTF-8” from the same terminal session I use to 
launch spark-submit, if I run locale command I get correct values:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

While running my pyspark script, from the Spark UI,  under Environment -> Spark 
Properties the locale appear to be correctly set:
- user.country: US
- user.language: en
- user.region: US
- user.timezone: GMT

but Environment -> System Properties still reports the System locale and not 
the session locale I previously set:
- user.country: IT
- user.language: it
- user.timezone: Europe/Rome

Am I wrong or the options don’t propagate to the JVM correctly?




> Il giorno 24 ott 2016, alle ore 16:49, Sean Owen <so...@cloudera.com> ha 
> scritto:
> 
> This is more of an OS-level thing, but I think that if you can manage to set 
> -Duser.language=en to the JVM, it might do the trick.
> 
> I summarized what I think I know about this at 
> https://issues.apache.org/jira/browse/SPARK-18076 
> <https://issues.apache.org/jira/browse/SPARK-18076> and so we can decide what 
> to do, if anything, there.
> 
> Sean
> 
> On Mon, Oct 24, 2016 at 3:08 PM Pietro Pugni <pietro.pu...@gmail.com 
> <mailto:pietro.pu...@gmail.com>> wrote:
> Thank you, I’ll appreciate that. I have no experience with Python, Java and 
> Spark, so I the question can be translated to: “How can I set JVM locale when 
> using spark-submit and pyspark?”. Probably this is possible only by changing 
> the system defaul locale and not within the Spark session, right?
> 
> Thank you
>  Pietro
> 
>> Il giorno 24 ott 2016, alle ore 14:51, Hyukjin Kwon <gurwls...@gmail.com 
>> <mailto:gurwls...@gmail.com>> ha scritto:
>> 
>> I am also interested in this issue. I will try to look into this too within 
>> coming few days..
>> 
>> 2016-10-24 21:32 GMT+09:00 Sean Owen <so...@cloudera.com 
>> <mailto:so...@cloudera.com>>:
>> I actually think this is a general problem with usage of DateFormat and 
>> SimpleDateFormat across the code, in that it relies on the default locale of 
>> the JVM. I believe this needs to, at least, default consistently to 
>> Locale.US so that behavior is consistent; otherwise it's possible that 
>> parsing and formatting of dates could work subtly differently across 
>> environments.
>> 
>> There's a similar question about some code that formats dates for the UI. 
>> It's more reasonable to let that use the platform-default locale, but, I'd 
>> still favor standardizing it I think.
>> 
>> Anyway, let me test it out a bit and possibly open a JIRA with this change 
>> for discussion.
>> 
>> On Mon, Oct 24, 2016 at 1:03 PM pietrop <pietro.pu...@gmail.com 
>> <mailto:pietro.pu...@gmail.com>> wrote:
>> Hi there,
>> I opened a question on StackOverflow at this link:
>> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972
>>  
>> <http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972>
>> 
>> I didn’t get any useful answer, so I’m writing here hoping that someone can
>> help me.
>> 
>> In short, I’m trying to read a CSV containing data columns stored using the
>> pattern “yyyyMMMdd”. What doesn’t work for me is “MMM”. I’ve done some
>> testing and discovered that it’s a localization issue. As you can read from
>> the StackOverflow question, I run a simple Java code to parse the date
>> “1989Dec31” and it works only if I specify Locale.US in the
>> SimpleDateFormat() function.
>> 
>> I would like pyspark to work. I tried setting a different local from console
>> (LANG=“en_US”), but it doesn’t work. I tried also setting it using the
>> locale package from Python.
>> 
>> So, there’s a way to set locale in Spark when using pyspark? The issue is
>> Java related and not Python related (the function that parses data is
>> invoked by spark.read.load(dateFormat=“yyyyMMMdd”, …). I don’t want to use
>> other solutions in order to encode data because they are slower (from what
>> I’ve seen so far).
>> 
>> Thank you
>> Pietro
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-dateFormat-pattern-in-spark-read-load-for-dates-like-1989Dec31-and-31D9-tp27951.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-dateFormat-pattern-in-spark-read-load-for-dates-like-1989Dec31-and-31D9-tp27951.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
>

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

Reply via email to