> *Question 1*: For my own understanding: what purpose does the
> millisecond date64 type serve?
I don't actually know the answer to this one.
> *Question 2* Relates to the definition and implementation of the
> date64 data type:
> ...
> Shouldn't (Py)Arrow either reject the input, or
> convert it when explicitly asked to?
Yes. There was a past discussion on this topic and a vote to agree
that these are invalid. See [1]. Feel free to file JIRAs where this
doesn't happen. The validation picture has improved somewhat in [2]
which should be a part of the next release:
```
# When given an array we sometimes will not automatically
# validate if the validation requires inspecting the values
# which is expensive
>>> pa.array([86400],pa.time32('s'))
<pyarrow.lib.Time32Array object at 0x7f7c4a8eae80>
[
<value out of range: 86400>
]
# There is a validate() method that can be called to do this
>>> pa.array([86400],pa.time32('s')).validate(full=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 1435, in pyarrow.lib.Array.validate
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: time32[s] 86400 is not within the acceptable
range of [0, 86400) s
# Even in the latest master it seems we do not apply this validation
# on scalar values. Please do file a JIRA for this
>>> pa.scalar(86400,pa.time32('s'))
<pyarrow.Time32Scalar: datetime.time(0, 0)>
```
> *Question 3*: both the time32 and time64 time-of-day types, in either
> precision, accept and store integer input that falls outside of the 24-hour
> window.
> ...
> What's the
> desirable behaviour from the Arrow specification perspective? Is it the
> current behaviour, or should the input either be rejected or explicitly
> converted?
The invalid values should be rejected. Ideally at the boundaries. However,
when data is already in the correct memory layout, we need to allow
the possibility
of zero-copy, and so we may not implicitly validate.
For example, I would expect the following will always pass without error:
```
pa.array([86400], pa.int32()).cast(pa.time32('s'))
```
On the other hand this should always fail:
```
pa.array([86400], pa.int32()).cast(pa.time32('s')).validate(full=True)
```
Users should generally validate any data that they don't know for sure
is correct.
[1] https://lists.apache.org/thread/0yks6lkv0p7kd3b46gcbc3cbr2y4kl95
[2] https://issues.apache.org/jira/browse/ARROW-10924
On Thu, Mar 31, 2022 at 11:27 PM Marnix van den Broek
<[email protected]> wrote:
>
> hi all,
>
> I'm working on type conversions between different systems, and the details
> of both the time and date data types raised some questions about their
> behaviour and a potential impact on interoperability:
>
> *Question 1*: For my own understanding: what purpose does the millisecond
> date64 type serve?
>
> *Question 2* Relates to the definition and implementation of the date64
> data type:
>
> The definition of date64 is from schema.fbs[1] is:
> *Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
> leap seconds), where the values are evenly divisible by 86400000*
>
> However, In PyArrow I can create Date64 instances using integer input
> values that are not evenly divisible by 86400000 and the original input
> persists in the Arrow dataframe. That seems very counterintuitive and a
> potential cause for bugs in low level transformations and when moving data
> between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
> convert it when explicitly asked to?
>
> >>> pa.scalar(86499999, pa.date64())
> <pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
> >>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
> <pyarrow.Int64Scalar: 86499999>
>
>
> *Question 3*: both the time32 and time64 time-of-day types, in either
> precision, accept and store integer input that falls outside of the 24-hour
> window. Like the issue raised about the date64 type, this seems like
> unexpected behavior, possibly even impacting interoperability. I expected
> the boundaries of these values to be enforced. What's the
> desirable behaviour from the Arrow specification perspective? Is it the
> current behaviour, or should the input either be rejected or explicitly
> converted?
>
> See:
>
> >>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
> <pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
> >>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
> <pyarrow.Int32Scalar: -1>
> >>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
> <pyarrow.Time32Scalar: datetime.time(0, 0)>
> >>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
> <pyarrow.Int32Scalar: 86400>
>
>
> I'm looking for answers to understand the intended behaviour. If question 2
> and 3 are actually issues with the implementations, let me know and I'll
> raise them on Github (or Jira if that's where they belong).
>
> Thanks,
> Marnix van den Broek
>
> Data Engineer at bundlesandbatches.io
>
> [1]
> https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201