[issue39280] Don't allow datetime parsing to accept non-Ascii digits

Paul Ganssle Fri, 10 Jan 2020 07:16:37 -0800

Paul Ganssle <p.gans...@gmail.com> added the comment:

> Yes, but not within the same format. If someone were to choose the format 
> '2014-04-10T24:00:00', they would have a reasonable expectation that there is 
> only one unique string that corresponds with that datetime


That's a particularly bad example, because it's exactly the same as another 
string with the exact same format:

  2014-04-11T00:00:00

Since ISO 8601 allows you to specify midnight (and only midnight) using 
previous day + 24:00. Admittedly, that is the only ambiguity I know of offhand 
(though it's a huge spec) *for a given format*, but also ISO 8601 does not 
really have a concept of format specifiers, so it's not like there's a way to 
unambiguously specify the format you are intending to use.

Either way, I think we can explicitly dispense with "there will be an exact 
mapping between a given (format_str, datetime_str) pair and the datetime it 
produces" as a goal here. I can't think of any good reason you'd want that 
property, nor have we made any indication that I can see that we provide it 
(probably the opposite, since there are some formats that explicitly ignore 
whitespace).

> Okay, since it seems like I'm the only one who wants this change, I'll let it 
> go. Thanks for your input.

I wouldn't go that far. I think I am +0 or +1 on this change, I just wanted to 
be absolutely clear *why* we're doing this. I don't want someone pointing at 
this thread in the future and saying, "Core dev says that it's a bug in their 
code if they don't follow X standard / if more than one string produces the 
same datetime / etc".

I think the strongest argument for making this or a similar change is that I'm 
fairly certain that we don't have the bandwidth to handle internationalized 
dates and I don't think we have much to gain by doing a sort of half-assed 
version of that by accepting unicode transliterations of numerals and calling 
it a day. I think there are tons of edge cases here that could bite people, and 
if we don't support this *now* I'd rather give people an error message early in 
the process and try to point people at a library that is designed to handle 
datetime localization issues. If all we're going to do is switch [0-9] to \d 
(which won't work for the places where it's actually [1-9], mind you), I think 
people will get a better version of that with something like:

  def normalize_dt_str(dt_str):
      return "".join(str(int(x)) if x.isdigit() else x
                     for x in dt_str)

There are probably more robust and/or faster versions of this, but it's 
probably roughly equivalent to what we'd be doing here *anyway*, and at least 
people would have to opt-in to this.

I am definitely open to us supporting non-ASCII digits in strptime if it would 
be useful at the level of support we could provide, but given that it's 
currently broken for any reasonable use case and as far as I know no one has 
complained, we're better off resolving the inconsistency by requiring ASCII 
digits and considering non-ASCII support to be a separate feature request.

CC-ing Inada on this as unicode guru and because he might have some intuition 
about how useful non-ASCII support might be. The only place I've seen non-ASCII 
dates is in Japanese graveyards, and those tend to use Chinese numerals (which 
don't match \d anyway), though Japanese and Korean also tends to make heavier 
use of "full-width numerals" block, so maybe parsing something like 
"２０２０－０２－０２" is an actual pain point that would be improved by this change 
(though, again, I suspect that this is just the beginning of the required 
changes and we may never get a decent implementation that supports unicode 
numerals).

----------
nosy: +inada.naoki

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39280>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39280] Don't allow datetime parsing to accept non-Ascii digits

Reply via email to