[
https://issues.apache.org/jira/browse/ARROW-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-15883:
------------------------------------------
Description:
Currently, we can't parse "our own" string representation of a timestamp array
with the timestamp parser {{strptime}}:
{code:python}
import datetime
import pyarrow as pa
import pyarrow.compute as pc
>>> pa.array([datetime.datetime(2022, 3, 5, 9)])
<pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
[
2022-03-05 09:00:00.000000
]
# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S",
>>> unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar
of type timestamp[us]
{code}
The reason for this is the fractional second part, so the following works:
{code:python}
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
<pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
[
2022-03-05 09:00:00.000000
]
{code}
Now, I think the reason that this fails is because {{strptime}} only supports
parsing seconds as an integer
(https://man7.org/linux/man-pages/man3/strptime.3.html).
But, it creates a strange situation where the timestamp parser cannot parse the
representation we use for timestamps.
In addition, for CSV we have a custom ISO parser (used by default), so when
parsing the strings while reading a CSV file, the same string with fractional
seconds does work:
{code:python}
s = b"""a
2022-03-05 09:00:00.000000"""
from pyarrow import csv
>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]
----
a: [[2022-03-05 09:00:00.000000000]]
{code}
I realize that you can use the generic "cast" for doing this string parsing:
{code:python}
>>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
<pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
[
2022-03-05 09:00:00.000000
]
{code}
But this was not the first way I thought about (I think it is quite typical to
first think of {{strptime}}, and it is confusing that that doesn't work; the
error message is also not helpful)
cc [~apitrou] [~rokm]
was:
Currently, we can't parse "our own" string representation of a timestamp array
with the timestamp parser {{strptime}}:
{code:python}
import datetime
import pyarrow as pa
import pyarrow.compute as pc
>>> pa.array([datetime.datetime(2022, 3, 5, 9)])
<pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
[
2022-03-05 09:00:00.000000
]
# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S",
>>> unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar
of type timestamp[us]
{code}
The reason for this is the fractional second part, so the following works:
{code:python}
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
<pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
[
2022-03-05 09:00:00.000000
]
{code}
Now, I think the reason that this fails is because {{strptime}} only supports
parsing seconds as an integer
(https://man7.org/linux/man-pages/man3/strptime.3.html).
But, it creates a strange situation where the timestamp parser cannot parse the
representation we use for timestamps.
In addition, for CSV we have a custom ISO parser (used by default), so when
parsing the strings while reading a CSV file, the same string with fractional
seconds does work:
{code:python}
s = b"""a
2022-03-05 09:00:00.000000"""
from pyarrow import csv
>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]
----
a: [[2022-03-05 09:00:00.000000000]]
{code}
cc [~apitrou] [~rokm]
> [C++] Support for fractional seconds in strptime() for ISO format?
> ------------------------------------------------------------------
>
> Key: ARROW-15883
> URL: https://issues.apache.org/jira/browse/ARROW-15883
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: kernel
>
> Currently, we can't parse "our own" string representation of a timestamp
> array with the timestamp parser {{strptime}}:
> {code:python}
> import datetime
> import pyarrow as pa
> import pyarrow.compute as pc
> >>> pa.array([datetime.datetime(2022, 3, 5, 9)])
> <pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
> [
> 2022-03-05 09:00:00.000000
> ]
> # trying to parse the above representation as string
> >>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S",
> >>> unit="us")
> ...
> ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a
> scalar of type timestamp[us]
> {code}
> The reason for this is the fractional second part, so the following works:
> {code:python}
> >>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S",
> >>> unit="us")
> <pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
> [
> 2022-03-05 09:00:00.000000
> ]
> {code}
> Now, I think the reason that this fails is because {{strptime}} only supports
> parsing seconds as an integer
> (https://man7.org/linux/man-pages/man3/strptime.3.html).
> But, it creates a strange situation where the timestamp parser cannot parse
> the representation we use for timestamps.
> In addition, for CSV we have a custom ISO parser (used by default), so when
> parsing the strings while reading a CSV file, the same string with fractional
> seconds does work:
> {code:python}
> s = b"""a
> 2022-03-05 09:00:00.000000"""
> from pyarrow import csv
> >>> csv.read_csv(io.BytesIO(s))
> pyarrow.Table
> a: timestamp[ns]
> ----
> a: [[2022-03-05 09:00:00.000000000]]
> {code}
> I realize that you can use the generic "cast" for doing this string parsing:
> {code:python}
> >>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
> <pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
> [
> 2022-03-05 09:00:00.000000
> ]
> {code}
> But this was not the first way I thought about (I think it is quite typical
> to first think of {{strptime}}, and it is confusing that that doesn't work;
> the error message is also not helpful)
> cc [~apitrou] [~rokm]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)