[jira] [Comment Edited] (ARROW-13625) [C++][CSV] Timestamp parsing should accept any valid ISO 8601 without requiring custom parse strings

Joris Van den Bossche (Jira) Thu, 10 Feb 2022 02:18:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490074#comment-17490074
 ]


Joris Van den Bossche edited comment on ARROW-13625 at 2/10/22, 10:17 AM:
--------------------------------------------------------------------------

I think this issue is indeed resolved. By default, Arrow will now infer a 
"timestamp with timezone UTC" type for such data:

{code:python}
import io
from pyarrow import csv

s = """a
2021-08-11T17:39:50-04:00"""

>>> csv.read_csv(io.BytesIO(s.encode()))
pyarrow.Table
a: timestamp[s, tz=UTC]
----
a: [[2021-08-11 21:39:50]]
{code}

which seems like a good default behaviour?

It indeed fails when specifying "timestamp" as the type for that column:

{code}
>>> csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns')}))
...
ArrowInvalid: In CSV column #0: CSV conversion error to timestamp[ns]: expected 
no zone offset in '2021-08-11T17:39:50-04:00'
{code}

But that is probably fine? (otherwise you have the ambiguity about using the 
UTC value vs the printed value as the tz-naive timestamp) 
The error message is also clear about not expecting a timezone offset. We could 
maybe further improve the error message by saying something about "... for a 
timestamp without timezone" to stress that it is because the specified type has 
no timezone. 
(the reverse case of expecting a zone offset if there is none also has an 
expanded message: 
https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/cpp/src/arrow/csv/converter.cc#L382-L391)

And it works fine if you specify a timestamp type with timezone for the column:

{code}
>>> csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns', 
>>> tz="UTC")}))
pyarrow.Table
a: timestamp[ns, tz=UTC]
----
a: [[2021-08-11 21:39:50.000000000]]
{code}

I don't think it is directly related to ARROW-14442 because here it is about 
parsing a string, while ARROW-14442 is about converting an R object that 
already represents some form of timestamp.

bq. I think the timestamp Neil is referring to contains the timezone as offset 
from UTC (the %z in R's format) which doesn't seem to be recognised by arrow 
(which recognises only string timezones - R's %Z).

Arrow actually does "support" such offset timezone ({{%z}}), at least in the 
format specification (see 
https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/format/Schema.fbs#L341-L351).
 However, we currently don't really have support for that kind of timezone in 
any kernel.

So specifying such a offset timezone works:

{code}
>>> table = csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns', 
>>> tz="+05:00")}))
>>> table 
pyarrow.Table
a: timestamp[ns, tz=+05:00]
----
a: [[2021-08-11 21:39:50.000000000]]
{code}

(now, this is a bit confusing because it doesn't actually matter here, I could 
also have used {{tz="blabla"}} and it would also have worked. It doesn't matter 
because once it a timestamp with timezone type, Arrow will just convert the 
string to the underlying UTC value, which is always the same regardless of the 
actual timezone).

But once you do an operation, it currently fails:

{code}
>>> import pyarrow.compute as pc
>>> pc.hour(table['a'])
...
ArrowInvalid: Cannot locate timezone '+05:00': +05:00 not found in timezone 
database
{code}

The JIRA for this is ARROW-14477


was (Author: jorisvandenbossche):
I think this issue is indeed resolved. By default, Arrow will now infer a 
"timestamp with timezone UTC" type for such data:

{code:python}
import io
from pyarrow import csv

s = """a
2021-08-11T17:39:50-04:00"""

>>> csv.read_csv(io.BytesIO(s.encode()))
pyarrow.Table
a: timestamp[s, tz=UTC]
----
a: [[2021-08-11 21:39:50]]
{code}

which seems like a good default behaviour?

It indeed fails when specifying "timestamp" as the type for that column:

{code}
>>> csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns')}))
...
ArrowInvalid: In CSV column #0: CSV conversion error to timestamp[ns]: expected 
no zone offset in '2021-08-11T17:39:50-04:00'
{code}

But that is probably fine? (otherwise you have the ambiguity about using the 
UTC value vs the printed value as the tz-naive timestamp) 
The error message is also clear about not expecting a timezone offset. We could 
maybe further improve the error message by saying something about "... for a 
timestamp without timezone" to stress that it is because the specified type has 
no timezone. 
(the reverse case of expecting a zone offset if there is none also has an 
expanded message: 
https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/cpp/src/arrow/csv/converter.cc#L382-L391)

And it works fine if you specify a timestamp type with timezone for the column:

{code}
>>> csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns', 
>>> tz="UTC")}))
pyarrow.Table
a: timestamp[ns, tz=UTC]
----
a: [[2021-08-11 21:39:50.000000000]]
{code}

I don't think it is directly related to ARROW-14442 because here it is about 
parsing a string, while ARROW-14442 is about converting an R object that 
already represents some form of timestamp.

bq. I think the timestamp Neil is referring to contains the timezone as offset 
from UTC (the %z in R's format) which doesn't seem to be recognised by arrow 
(which recognises only string timezones - R's %Z).

Arrow actually does "support" such offset timezone ({{%z}}), at least in the 
format specification (see 
https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/format/Schema.fbs#L341-L351).
 However, we currently don't really have support for that kind of timezone in 
any kernel.

So specifying such a offset timezone works:

{code}
>>> table = csv.read_csv(io.BytesIO(s.encode()), 
>>> convert_options=csv.ConvertOptions(column_types={'a': pa.timestamp('ns', 
>>> tz="+05:00")}))
>>> table 
pyarrow.Table
a: timestamp[ns, tz=+05:00]
----
a: [[2021-08-11 21:39:50.000000000]]
{code}

(now, this is a bit confusing because it doesn't actually matter here, I could 
also have used {{tz="blabla"}} and it would also have worked. It doesn't matter 
because once it a timestamp with timezone, Arrow will just convert the string 
to the underlying UTC value, which is always the same regardless of the actual 
timezone).

But once you do an operation, it currently fails:

{code}
>>> import pyarrow.compute as pc
>>> pc.hour(table['a'])
...
ArrowInvalid: Cannot locate timezone '+05:00': +05:00 not found in timezone 
database
{code}

The JIRA for this is ARROW-14477

> [C++][CSV] Timestamp parsing should accept any valid ISO 8601 without 
> requiring custom parse strings
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13625
>                 URL: https://issues.apache.org/jira/browse/ARROW-13625
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Neal Richardson
>            Priority: Major
>             Fix For: 8.0.0
>
>
> I was trying to read in some git logs and got this parse error for a column I 
> had declared as timestamp type:
> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2021-08-11T17:39:50-04:00'
> This is valid ISO 8601 and is what git log produces with the {{I}} "strict 
> ISO 8601 format" option (https://git-scm.com/docs/pretty-formats). 
> I see mentioned on ARROW-10343 that timezone indicators are not supported--is 
> that still true? And I recognize that it's not trivial because a timestamp 
> array has to have the same timezone for all values, so if some rows in this 
> CSV had different timezones listed, we would have to handle that (converting 
> everything to UTC is probably the most useful thing but technically loses 
> information).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-13625) [C++][CSV] Timestamp parsing should accept any valid ISO 8601 without requiring custom parse strings

Reply via email to