Re: PyArrow / Arrow questions about the time and date types

Wes McKinney Mon, 04 Apr 2022 11:55:09 -0700

On Fri, Apr 1, 2022 at 2:00 PM Weston Pace <[email protected]> wrote:
>
> > *Question 1*: For my own understanding: what purpose does the
> > millisecond date64 type serve?
>
> I don't actually know the answer to this one.


The rationale IIRC was that some systems represent dates this way, and
so the purpose was to provide a serialization-free path for such data.

> > *Question 2* Relates to the definition and implementation of the
> > date64 data type:
> > ...
> > Shouldn't (Py)Arrow either reject the input, or
> > convert it when explicitly asked to?
>
> Yes.  There was a past discussion on this topic and a vote to agree
> that these are invalid.  See [1].  Feel free to file JIRAs where this
> doesn't happen.  The validation picture has improved somewhat in [2]
> which should be a part of the next release:
>
> ```
> # When given an array we sometimes will not automatically
> # validate if the validation requires inspecting the values
> # which is expensive
> >>> pa.array([86400],pa.time32('s'))
> <pyarrow.lib.Time32Array object at 0x7f7c4a8eae80>
> [
>   <value out of range: 86400>
> ]
>
> # There is a validate() method that can be called to do this
> >>> pa.array([86400],pa.time32('s')).validate(full=True)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/array.pxi", line 1435, in pyarrow.lib.Array.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: time32[s] 86400 is not within the acceptable
> range of [0, 86400) s
>
> # Even in the latest master it seems we do not apply this validation
> # on scalar values.  Please do file a JIRA for this
> >>> pa.scalar(86400,pa.time32('s'))
> <pyarrow.Time32Scalar: datetime.time(0, 0)>
> ```
>
> > *Question 3*: both the time32 and time64 time-of-day types, in either
> > precision, accept and store integer input that falls outside of the 24-hour
> > window.
> > ...
> > What's the
> > desirable behaviour from the Arrow specification perspective? Is it the
> > current behaviour, or should the input either be rejected or explicitly
> > converted?
>
> The invalid values should be rejected.  Ideally at the boundaries.  However,
> when data is already in the correct memory layout, we need to allow
> the possibility
> of zero-copy, and so we may not implicitly validate.
>
> For example, I would expect the following will always pass without error:
>
> ```
> pa.array([86400], pa.int32()).cast(pa.time32('s'))
> ```
>
> On the other hand this should always fail:
>
> ```
> pa.array([86400], pa.int32()).cast(pa.time32('s')).validate(full=True)
> ```
>
> Users should generally validate any data that they don't know for sure
> is correct.
>
> [1] https://lists.apache.org/thread/0yks6lkv0p7kd3b46gcbc3cbr2y4kl95
> [2] https://issues.apache.org/jira/browse/ARROW-10924
>
> On Thu, Mar 31, 2022 at 11:27 PM Marnix van den Broek
> <[email protected]> wrote:
> >
> > hi all,
> >
> > I'm working on type conversions between different systems, and the details
> > of both the time and date data types raised some questions about their
> > behaviour and a potential impact on interoperability:
> >
> > *Question 1*: For my own understanding: what purpose does the millisecond
> > date64 type serve?
> >
> > *Question 2* Relates to the definition and implementation of the date64
> > data type:
> >
> > The definition of date64 is from schema.fbs[1] is:
> > *Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
> > leap seconds), where the values are evenly divisible by 86400000*
> >
> > However, In PyArrow I can create Date64 instances using integer input
> > values that are not evenly divisible by 86400000 and the original input
> > persists in the Arrow dataframe. That seems very counterintuitive and a
> > potential cause for bugs in low level transformations and when moving data
> > between systems with Arrow. Shouldn't (Py)Arrow either reject the input, or
> > convert it when explicitly asked to?
> >
> > >>> pa.scalar(86499999, pa.date64())
> > <pyarrow.Date64Scalar: datetime.date(1970, 1, 2)>
> > >>> pa.scalar(86499999, pa.date64()).cast(pa.int64())
> > <pyarrow.Int64Scalar: 86499999>
> >
> >
> > *Question 3*: both the time32 and time64 time-of-day types, in either
> > precision, accept and store integer input that falls outside of the 24-hour
> > window. Like the issue raised about the date64 type, this seems like
> > unexpected behavior, possibly even impacting interoperability. I expected
> > the boundaries of these values to be enforced. What's the
> > desirable behaviour from the Arrow specification perspective? Is it the
> > current behaviour, or should the input either be rejected or explicitly
> > converted?
> >
> > See:
> >
> > >>> pa.scalar(-1,pa.time32('s')) # expected: exception or warning
> > <pyarrow.Time32Scalar: datetime.time(23, 59, 59)>
> > >>> pa.scalar(-1,pa.time32('s')).cast(pa.int32()) # expected: 86399
> > <pyarrow.Int32Scalar: -1>
> > >>> pa.scalar(86400,pa.time32('s')) # expected: exception or warning
> > <pyarrow.Time32Scalar: datetime.time(0, 0)>
> > >>> pa.scalar(86400,pa.time32('s')).cast(pa.int32()) # expected: 0
> > <pyarrow.Int32Scalar: 86400>
> >
> >
> > I'm looking for answers to understand the intended behaviour. If question 2
> > and 3 are actually issues with the implementations, let me know and I'll
> > raise them on Github (or Jira if that's where they belong).
> >
> > Thanks,
> > Marnix van den Broek
> >
> > Data Engineer at bundlesandbatches.io
> >
> > [1]
> > https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/format/Schema.fbs#L200-L201

Re: PyArrow / Arrow questions about the time and date types

Reply via email to