[ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002762#comment-16002762
 ] 

Rick Moritz commented on SPARK-20489:
-------------------------------------

Okay, I had a look at timezones, and it appears my Driver was in a different 
timezone, than my executors. Having put all machines into UTC, I now get the 
expected result.
It helped, when I realized that unix timestamps aren't always in UTC, for some 
reason (thanks Jacek, 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Expression-UnixTimestamp.html).

I wonder what could be done to avoid these quite bizarre error situations. 
Especially in big data, where in theory clusters could reasonably span across 
timezones, and in particular drivers )in client modes, for example) could be 
well "outside" the actual cluster, relying on the local timezone settings 
appears to me to be quite fragile.

As an initial measure, I wonder how much effort it would be, to make sure that 
when dates are used, any discrepancy in timezone settings would result in an 
exception being thrown. I would consider that a reasonable assertion, with not 
too much overhead.

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20489
>                 URL: https://issues.apache.org/jira/browse/SPARK-20489
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>         Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>            Reporter: Rick Moritz
>            Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
>     sc.parallelize(size)
>     .map(Row(_)),
>     StructType(Array(StructField("id", IntegerType, nullable=false))
>         )
>     )
>     .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
>     
> val rddList = counter.map(
>             count => sampleText
>             .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"yyyy-MM-dd'T'HH:mm:ss.SSS"))
>             .drop(col("loadDTS"))
>             .withColumnRenamed("loadDTS2","loadDTS")
>             .coalesce(4)
>             .rdd
>         )
> val resultText = spark.createDataFrame(
>     spark.sparkContext.union(rddList),
>     sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
>     max(unix_timestamp($"loadDTS", "yyyy-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "yyyy-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to