Re: Very long pause/hang at end of execution

2016-11-16 Thread Pietro Pugni
I have the same issue with Spark 2.0.1, Java 1.8.x and pyspark. I also use
SparkSQL and JDBC. My application runs locally. It happens only of I
connect to the UI during Spark execution and even if I close the browser
before the execution ends. I observed this behaviour both on macOS Sierra
and Red Hat 6.7

Il 16 nov 2016 3:09 AM, "Michael Johnson" 
ha scritto:

> The extremely long hand/pause has started happening again. I've been
> running on a small remote cluster, so I used the UI to grab thread dumps
> rather than doing it from the command line. There seems to be one executor
> still alive, along with the driver; I grabbed 4 thread dumps from each, a
> couple of seconds apart. I'd greatly appreciate any help tracking down
> what's going on! (I've attached them, but I can paste them somewhere if
> that's more convenient.)
>
> Thanks,
> Michael
>
>
>
>
> On Sunday, November 6, 2016 10:49 PM, Michael Johnson <
> mjjohnson@yahoo.com.INVALID> wrote:
>
>
> Hm. Something must have changed, as it was happening quite consistently
> and now I can't get it to reproduce. Thank you for the offer, and if it
> happens again I will try grabbing thread dumps and I will see if I can
> figure out what is going on.
>
>
> On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>
> I doubt it's GC as you mentioned that the pause is several minutes. Since
> it's reproducible in local mode, can you run the spark application locally
> and once your job is complete (and application appears paused), can you
> take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
> with 1 second delay between each dump and attach them? I can take a look.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson 
> wrote:
>
> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson 
> 
> wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
Thank you Davies,
this worked! But what are the consequences of setting 
spark.sql.autoBroadcastJoinThreshold=0?
Will it degrade or boost performance?
Thank you again
 Pietro

> Il giorno 27 ott 2016, alle ore 18:54, Davies Liu <dav...@databricks.com> ha 
> scritto:
> 
> I think this is caused by BroadcastHashJoin try to use more memory
> than the amount driver have, could you decrease the
> spark.sql.autoBroadcastJoinThreshold  (-1 or 0  means disable it)?
> 
> On Thu, Oct 27, 2016 at 9:19 AM, Pietro Pugni <pietro.pu...@gmail.com> wrote:
>> I’m sorry, here’s the formatted message text:
>> 
>> 
>> 
>> I'm running an ETL process that joins table1 with other tables (CSV files),
>> one table at time (for example table1 with table2, table1 with table3, and
>> so on). The join is written inside a PostgreSQL istance using JDBC.
>> 
>> The entire process runs successfully if I use table2, table3 and table4. If
>> I add table5, table6, table7, the process run successfully with table5,
>> table6 and table7 but as soon as it reaches table2 it starts displaying a
>> lot of messagges like this:
>> 
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> ...
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> ...
>> Traceback (most recent call last):
>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in
>> 
>> 
>>sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella,
>> "append")
>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in
>> sparkdf2database
>>properties={"ApplicationName":info["nome"] + " - Scrittura della tabella
>> " + dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"}
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 762, in jdbc
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
>> line 1133, in __call__
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>> line 63, in deco
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>> line 319, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc.
>> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>>at
>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>>at
>> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>>at
>> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>at
>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>>at
>> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
>>at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>>at
>> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>>at
>> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>>at
>> org.apache.spark.sql.execution.Code

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
I’m sorry, here’s the formatted message text:



I'm running an ETL process that joins table1 with other tables (CSV files), one 
table at time (for example table1 with table2, table1 with table3, and so on). 
The join is written inside a PostgreSQL istance using JDBC. 

The entire process runs successfully if I use table2, table3 and table4. If I 
add table5, table6, table7, the process run successfully with table5, table6 
and table7 but as soon as it reaches table2 it starts displaying a lot of 
messagges like this: 

16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
... 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
... 
Traceback (most recent call last): 
  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in 
sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella, 
"append") 
  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in 
sparkdf2database 
properties={"ApplicationName":info["nome"] + " - Scrittura della tabella " 
+ dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"} 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
 line 762, in jdbc 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__ 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc. 
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) 
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
 
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
 
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
 
at 
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
 
at 
org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
 
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.produce(ExistingRDD.scala:150)
 
at 
org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
 
at 

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-26 Thread Pietro Pugni
And what if the month abbreviation is upper-case? Java doesn’t parse the 
month-name, for example if it's “JAN" instead of “Jan” or “DEC” instead of 
“Dec". Is it possible to solve this issue without using UDFs? 

Many thanks again
 Pietro


> Il giorno 24 ott 2016, alle ore 17:33, Pietro Pugni <pietro.pu...@gmail.com> 
> ha scritto:
> 
> This worked without setting other options:
> spark/bin/spark-submit --conf 
> "spark.driver.extraJavaOptions=-Duser.language=en" test.py
> 
> Thank you again!
>  Pietro
> 
>> Il giorno 24 ott 2016, alle ore 17:18, Sean Owen <so...@cloudera.com 
>> <mailto:so...@cloudera.com>> ha scritto:
>> 
>> I believe it will be too late to set it there, and these are JVM flags, not 
>> app or Spark flags. See spark.driver.extraJavaOptions and likewise for the 
>> executor.
>> 
>> On Mon, Oct 24, 2016 at 4:04 PM Pietro Pugni <pietro.pu...@gmail.com 
>> <mailto:pietro.pu...@gmail.com>> wrote:
>> Thank you!
>> 
>> I tried again setting locale options in different ways but doesn’t propagate 
>> to the JVM. I tested these strategies (alone and all together):
>> - bin/spark-submit --conf 
>> "spark.executor.extraJavaOptions=-Duser.language=en -Duser.region=US 
>> -Duser.country=US -Duser.timezone=GMT” test.py
>> - spark = SparkSession \
>>  .builder \
>>  .appName("My app") \
>>  .config("spark.executor.extraJavaOptions", "-Duser.language=en 
>> -Duser.region=US -Duser.country=US -Duser.timezone=GMT") \
>>  .config("user.country", "US") \
>>  .config("user.region", "US") \
>>  .config("user.language", "en") \
>>  .config("user.timezone", "GMT") \
>>  .config("-Duser.country", "US") \
>>  .config("-Duser.region", "US") \
>>  .config("-Duser.language", "en") \
>>  .config("-Duser.timezone", "GMT") \
>>  .getOrCreate()
>> - export JAVA_OPTS="-Duser.language=en -Duser.region=US -Duser.country=US 
>> -Duser.timezone=GMT”
>> - export LANG="en_US.UTF-8”
>> 
>> After running export LANG="en_US.UTF-8” from the same terminal session I use 
>> to launch spark-submit, if I run locale command I get correct values:
>> LANG="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_CTYPE="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_ALL=
>> 
>> While running my pyspark script, from the Spark UI,  under Environment -> 
>> Spark Properties the locale appear to be correctly set:
>> - user.country: US
>> - user.language: en
>> - user.region: US
>> - user.timezone: GMT
>> 
>> but Environment -> System Properties still reports the System locale and not 
>> the session locale I previously set:
>> - user.country: IT
>> - user.language: it
>> - user.timezone: Europe/Rome
>> 
>> Am I wrong or the options don’t propagate to the JVM correctly?
>> 
>> 
> 



Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
This worked without setting other options:
spark/bin/spark-submit --conf 
"spark.driver.extraJavaOptions=-Duser.language=en" test.py

Thank you again!
 Pietro

> Il giorno 24 ott 2016, alle ore 17:18, Sean Owen <so...@cloudera.com> ha 
> scritto:
> 
> I believe it will be too late to set it there, and these are JVM flags, not 
> app or Spark flags. See spark.driver.extraJavaOptions and likewise for the 
> executor.
> 
> On Mon, Oct 24, 2016 at 4:04 PM Pietro Pugni <pietro.pu...@gmail.com 
> <mailto:pietro.pu...@gmail.com>> wrote:
> Thank you!
> 
> I tried again setting locale options in different ways but doesn’t propagate 
> to the JVM. I tested these strategies (alone and all together):
> - bin/spark-submit --conf "spark.executor.extraJavaOptions=-Duser.language=en 
> -Duser.region=US -Duser.country=US -Duser.timezone=GMT” test.py
> - spark = SparkSession \
>   .builder \
>   .appName("My app") \
>   .config("spark.executor.extraJavaOptions", "-Duser.language=en 
> -Duser.region=US -Duser.country=US -Duser.timezone=GMT") \
>   .config("user.country", "US") \
>   .config("user.region", "US") \
>   .config("user.language", "en") \
>   .config("user.timezone", "GMT") \
>   .config("-Duser.country", "US") \
>   .config("-Duser.region", "US") \
>   .config("-Duser.language", "en") \
>   .config("-Duser.timezone", "GMT") \
>   .getOrCreate()
> - export JAVA_OPTS="-Duser.language=en -Duser.region=US -Duser.country=US 
> -Duser.timezone=GMT”
> - export LANG="en_US.UTF-8”
> 
> After running export LANG="en_US.UTF-8” from the same terminal session I use 
> to launch spark-submit, if I run locale command I get correct values:
> LANG="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_CTYPE="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_ALL=
> 
> While running my pyspark script, from the Spark UI,  under Environment -> 
> Spark Properties the locale appear to be correctly set:
> - user.country: US
> - user.language: en
> - user.region: US
> - user.timezone: GMT
> 
> but Environment -> System Properties still reports the System locale and not 
> the session locale I previously set:
> - user.country: IT
> - user.language: it
> - user.timezone: Europe/Rome
> 
> Am I wrong or the options don’t propagate to the JVM correctly?
> 
> 



Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
Thank you!

I tried again setting locale options in different ways but doesn’t propagate to 
the JVM. I tested these strategies (alone and all together):
- bin/spark-submit --conf "spark.executor.extraJavaOptions=-Duser.language=en 
-Duser.region=US -Duser.country=US -Duser.timezone=GMT” test.py
- spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.executor.extraJavaOptions", "-Duser.language=en 
-Duser.region=US -Duser.country=US -Duser.timezone=GMT") \
.config("user.country", "US") \
.config("user.region", "US") \
.config("user.language", "en") \
.config("user.timezone", "GMT") \
.config("-Duser.country", "US") \
.config("-Duser.region", "US") \
.config("-Duser.language", "en") \
.config("-Duser.timezone", "GMT") \
.getOrCreate()
- export JAVA_OPTS="-Duser.language=en -Duser.region=US -Duser.country=US 
-Duser.timezone=GMT”
- export LANG="en_US.UTF-8”

After running export LANG="en_US.UTF-8” from the same terminal session I use to 
launch spark-submit, if I run locale command I get correct values:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

While running my pyspark script, from the Spark UI,  under Environment -> Spark 
Properties the locale appear to be correctly set:
- user.country: US
- user.language: en
- user.region: US
- user.timezone: GMT

but Environment -> System Properties still reports the System locale and not 
the session locale I previously set:
- user.country: IT
- user.language: it
- user.timezone: Europe/Rome

Am I wrong or the options don’t propagate to the JVM correctly?




> Il giorno 24 ott 2016, alle ore 16:49, Sean Owen <so...@cloudera.com> ha 
> scritto:
> 
> This is more of an OS-level thing, but I think that if you can manage to set 
> -Duser.language=en to the JVM, it might do the trick.
> 
> I summarized what I think I know about this at 
> https://issues.apache.org/jira/browse/SPARK-18076 
> <https://issues.apache.org/jira/browse/SPARK-18076> and so we can decide what 
> to do, if anything, there.
> 
> Sean
> 
> On Mon, Oct 24, 2016 at 3:08 PM Pietro Pugni <pietro.pu...@gmail.com 
> <mailto:pietro.pu...@gmail.com>> wrote:
> Thank you, I’ll appreciate that. I have no experience with Python, Java and 
> Spark, so I the question can be translated to: “How can I set JVM locale when 
> using spark-submit and pyspark?”. Probably this is possible only by changing 
> the system defaul locale and not within the Spark session, right?
> 
> Thank you
>  Pietro
> 
>> Il giorno 24 ott 2016, alle ore 14:51, Hyukjin Kwon <gurwls...@gmail.com 
>> <mailto:gurwls...@gmail.com>> ha scritto:
>> 
>> I am also interested in this issue. I will try to look into this too within 
>> coming few days..
>> 
>> 2016-10-24 21:32 GMT+09:00 Sean Owen <so...@cloudera.com 
>> <mailto:so...@cloudera.com>>:
>> I actually think this is a general problem with usage of DateFormat and 
>> SimpleDateFormat across the code, in that it relies on the default locale of 
>> the JVM. I believe this needs to, at least, default consistently to 
>> Locale.US so that behavior is consistent; otherwise it's possible that 
>> parsing and formatting of dates could work subtly differently across 
>> environments.
>> 
>> There's a similar question about some code that formats dates for the UI. 
>> It's more reasonable to let that use the platform-default locale, but, I'd 
>> still favor standardizing it I think.
>> 
>> Anyway, let me test it out a bit and possibly open a JIRA with this change 
>> for discussion.
>> 
>> On Mon, Oct 24, 2016 at 1:03 PM pietrop <pietro.pu...@gmail.com 
>> <mailto:pietro.pu...@gmail.com>> wrote:
>> Hi there,
>> I opened a question on StackOverflow at this link:
>> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972
>>  
>> <http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972>
>> 
>> I didn’t get any useful answer, so I’m writing here hoping that someone can
>> help me.
>> 
>> In short, I’m trying to read a CSV containing data columns

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
Thank you, I’ll appreciate that. I have no experience with Python, Java and 
Spark, so I the question can be translated to: “How can I set JVM locale when 
using spark-submit and pyspark?”. Probably this is possible only by changing 
the system defaul locale and not within the Spark session, right?

Thank you
 Pietro

> Il giorno 24 ott 2016, alle ore 14:51, Hyukjin Kwon  ha 
> scritto:
> 
> I am also interested in this issue. I will try to look into this too within 
> coming few days..
> 
> 2016-10-24 21:32 GMT+09:00 Sean Owen  >:
> I actually think this is a general problem with usage of DateFormat and 
> SimpleDateFormat across the code, in that it relies on the default locale of 
> the JVM. I believe this needs to, at least, default consistently to Locale.US 
> so that behavior is consistent; otherwise it's possible that parsing and 
> formatting of dates could work subtly differently across environments.
> 
> There's a similar question about some code that formats dates for the UI. 
> It's more reasonable to let that use the platform-default locale, but, I'd 
> still favor standardizing it I think.
> 
> Anyway, let me test it out a bit and possibly open a JIRA with this change 
> for discussion.
> 
> On Mon, Oct 24, 2016 at 1:03 PM pietrop  > wrote:
> Hi there,
> I opened a question on StackOverflow at this link:
> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972
>  
> 
> 
> I didn’t get any useful answer, so I’m writing here hoping that someone can
> help me.
> 
> In short, I’m trying to read a CSV containing data columns stored using the
> pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some
> testing and discovered that it’s a localization issue. As you can read from
> the StackOverflow question, I run a simple Java code to parse the date
> “1989Dec31” and it works only if I specify Locale.US in the
> SimpleDateFormat() function.
> 
> I would like pyspark to work. I tried setting a different local from console
> (LANG=“en_US”), but it doesn’t work. I tried also setting it using the
> locale package from Python.
> 
> So, there’s a way to set locale in Spark when using pyspark? The issue is
> Java related and not Python related (the function that parses data is
> invoked by spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use
> other solutions in order to encode data because they are slower (from what
> I’ve seen so far).
> 
> Thank you
> Pietro
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-dateFormat-pattern-in-spark-read-load-for-dates-like-1989Dec31-and-31D9-tp27951.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-13 Thread Pietro Pugni
Hi there,
I opened a question on StackOverflow at this link: 
http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972

I didn’t get any useful answer, so I’m writing here hoping that someone can 
help me.

In short, I’m trying to read a CSV containing data columns stored using the 
pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some testing 
and discovered that it’s a localization issue. As you can read from the 
StackOverflow question, I run a simple Java code to parse the date “1989Dec31” 
and it works only if I specify Locale.US in the SimpleDateFormat() function.

I would like pyspark to work. I tried setting a different local from console 
(LANG=“en_US”), but it doesn’t work. I tried also setting it using the locale 
package from Python.

So, there’s a way to set locale in Spark when using pyspark? The issue is Java 
related and not Python related (the function that parses data is invoked by 
spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use other solutions 
in order to encode data because they are slower (from what I’ve seen so far).

Thank you
 Pietro
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org