[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

Hyukjin Kwon (JIRA) Thu, 15 Sep 2016 22:19:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495408#comment-15495408
 ]


Hyukjin Kwon commented on SPARK-17545:
--------------------------------------

Hi [~nbeyer], the basic ISO format currently follows 
https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in 
http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the 
minimum number of
separators necessary for the required expression is used, or completely in 
extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300 }} whereas the extended format 
is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not 
be in the basic format when the date and time of day is in the extended format.




> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17545
>                 URL: https://issues.apache.org/jira/browse/SPARK-17545
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.<init>(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>       at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>       at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>       at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>       at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

Reply via email to