[ 
https://issues.apache.org/jira/browse/ARROW-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518650#comment-17518650
 ] 

Joris Van den Bossche commented on ARROW-16022:
-----------------------------------------------

> If they must fail, it should be when the pyarrow.Timestamp is created.

I would like to point out that Arrow actually _does_ "validate" the time upon 
creation. In the sense that we do convert the timezone-aware python datetime 
object into an unambiguous UTC value (which is guaranteed to exist) when 
creating the pyarrow timestamp array.

It is only that the current implementation of temporal rounding does this in 
local time, and in your case the unambiguous UTC timestamp is converted to a 
local time that is ambiguous (in local time), and then the conversion back to 
unambiguous UTC timestamp after rounding fails. We can solve this specific 
issue by improving the implementation of the temporal rounding algorithm, and 
that is what ARROW-15251([https://github.com/apache/arrow/pull/12528]) is about.

To illustrate my first point, let me get back to your example of an ambiguous 
datetime:
{code:python}
tz = zoneinfo.ZoneInfo(key='America/New_York')

# In the US, the 1:00am hour is the ambiguous because the minute after 1:59am 
Daylight-Savings Time is 1:00am Standard Time
# however, these times exist and 
date_ambig = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz)
{code}
This is in fact not an ambiguous datetime. As you show when printing this 
value, "native datetime object defaults to daylight time":
{code:python}
>>> print(date_ambig)
2013-11-03 01:03:14-04:00
{code}
but this is because the actual datetime defaults to {{{}fold=0{}}}, which 
corresponds to the offset of 04:00. This is something you control when 
_creating_ the actual datetime.datetime object, so we can explicitly construct 
the "other" value for this datetime with offset 05:00:
{code:python}
>>> date_ambig2 = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = tz, fold=1)
>>> print(date_ambig2)
2013-11-03 01:03:14-05:00

>>> pa.array([date_ambig, date_ambig2], pa.timestamp("us", 
>>> tz="America/New_York"))
<pyarrow.lib.TimestampArray object at 0x7fa58edb79a0>
[
  2013-11-03 05:03:14.000000,
  2013-11-03 06:03:14.000000
]
{code}

So both datetime.datetime values are actually representing a specific moment in 
time in this case, and are properly converted to UTC when creating the pyarrow 
array.

See https://peps.python.org/pep-0495/ for more details on this "fold" that was 
introduced for datetime.datetime to disambiguate local times.

> [C++] Temporal floor/ceil/round throws exception for timestamps ambiguous due 
> to DST
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-16022
>                 URL: https://issues.apache.org/jira/browse/ARROW-16022
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Kevin Crouse
>            Priority: Major
>
> Running pyarrow.compute.floor_temporal for timestamps that exist will throw 
> exceptions if the times are ambiguous during the daylight savings time 
> transitions. 
> As the *_temporal functions do not fundamentally change the times, it does 
> not make sense that they would fail due to a timezone issue. If they must 
> fail, it should be when the pyarrow.Timestamp is created.
>  
>  
> {code:java}
> import pyarrow
> import pyarrow.compute as pc
> import datetime
> import pytz
> t = pyarrow.timestamp('s', tz='America/New_York')
> dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = 
> pytz.timezone('America/New_York'))
> # if a timestamp must be invalid, this could fail
> za = pyarrow.array([dt], t) 
> # raises an exception, even though this is conceptually an identity function 
> here
> pc.floor_temporal(za, unit = 'second') {code}
>  
> And this actually works just fine (continued from above)
> {code:java}
> pc.cast(    
>     pc.floor_temporal(        
>         pc.cast(za, pyarrow.timestamp('s', 'UTC')),         
>     unit='second'),     
>     pyarrow.timestamp('s','America/New_York')
> )
>  {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to