[jira] [Comment Edited] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

Chaitanya P Chandurkar (JIRA) Wed, 23 Jan 2019 10:26:18 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750292#comment-16750292
 ]


Chaitanya P Chandurkar edited comment on SPARK-17914 at 1/23/19 6:25 PM:
-------------------------------------------------------------------------

I'm still seeing this issue in Spark 2.4.0 when using from_json() function. In 
ISO Zulu format datetime, it is not interpreting the timezone accurately after 
certain number of digits. Every digit added after 3rd digit in the timestamp is 
adding up more seconds to the parsed datetime.  For example, This datetime: 
"2019-01-23T17:50:29.9991Z" when parsed using spark's build-in from_json() 
function results in "2019-01-23T17:50:38.991+0000" ( Note the number of seconds 
added )

 

If I'm not wrong from_json() internally uses the Jackson JSON library. I'm not 
sure if the bug is within that or within spark.

 
{code:java}
// Create Schema to Parse JSON
val sc = StructType(
  StructField(
   "date", TimestampType
  ):: Nil
){code}
{code:java}
// Sample JSON Parsing using schema created
Seq( """{"date": "2019-01-22T18:33:39.134232733Z"}""" )
.toDF( "data" )
.withColumn( "parsed", from_json( $"data", sc ) )
{code}
This results in date being "2019-01-24T07:50:51.733+0000" ( Note the difference 
of 2 days ) 


was (Author: cchandurkar):
I'm still seeing this issue in Spark 2.4.0 when using from_json() function. In 
ISO Zulu format datetime, it is not interpreting the timezone accurately after 
certain number of digits. Every digit added after 3rd digit in the timestamp is 
adding up more seconds to the parsed datetime.  For example, This datetime: 
"2019-01-23T17:50:29.9991Z" when parsed using spark's build-in from_json() 
function results in "2019-01-23T17:50:38.991+0000" ( Note the number of seconds 
added )

 

If I'm not wrong from_json() internally uses the Jackson JSON library. I'm not 
sure if the bug is within that or within spark.

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-17914
>                 URL: https://issues.apache.org/jira/browse/SPARK-17914
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Oksana Romankova
>            Assignee: Anton Okolnychyi
>            Priority: Major
>             Fix For: 2.2.0, 2.3.0
>
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

Reply via email to