New submission from Ram Rachum <[email protected]>:
I've been doing some research into the use of `\d` in regular expressions in
CPython, and any security vulnerabilities that might happen as a result of the
fact that it accepts non-Ascii digits like ٢ and 5.
In most places in the CPython codebase, the `re.ASCII` flag is used for such
cases, thus ensuring the `re` module prohibits these non-Ascii digits.
Personally, my preference is to never use `\d` and always use `[0-9]`. I think
that it's rule that's more easy to enforce and less likely to result in a
slipup, but that's a matter of personal taste.
I found a few places where we don't use the `re.ASCII` flag and we do accept
non-Ascii digits.
The first and less interesting place is platform.py, where we define patterns
used for detecting versions of PyPy and IronPython. I don't know how anyone
would exploit that, but personally I'd change that to a [0-9] just to be safe.
I've opened bpo-39279 for that.
The more sensitive place is the `datetime` module.
Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii
digits. But the `datetime.datetime.strptime` function does not:
from datetime import datetime
time_format = '%Y-%m-%d'
parse = lambda s: datetime.strptime(s, time_format)
x = '٢019-12-22'
y = '2019-12-22'
assert x != y
assert parse(x) == parse(y)
print(parse(x))
# Output: 2019-12-22 00:00:00
If user code were to check for uniqueness of a datetime by comparing it as a
string, this is where an attacker could fool this logic, by using a non-Ascii
digit.
Two more interesting points about this:
1. If you'd try the same trick, but you'd insert ٢ in the day section instead
of the year section, Python would reject that. So we definitely have
inconsistent behavior.
2. In the documentation for `strptime`, we're referencing the 1989 C standard.
Since the first version of Unicode was published in 1991, it's reasonable not
to expect the standard to support digits that were introduced in Unicode.
If you'd scroll down in that documentation, you'll see that we also implement
the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week
number, and day of week. The `%G` is vulnerable:
from datetime import datetime
time_format = '%G-%V-%u'
parse = lambda s: datetime.strptime(s, time_format)
x = '٢019-53-4'
y = '2019-53-4'
assert x != y
assert parse(x) == parse(y)
print(parse(x))
# Output: 2020-01-02 00:00:00
I looked at the ISO 8601:2004 document, and under the "Fundamental principles"
chapter, it says:
This International Standard gives a set of rules for the representation of
time points
time intervals
recurring time intervals.
Both accurate and approximate representations can be identified by
means of unique and unambiguous expressions specifying the relevant dates,
times of day and durations.
Note the "unique and unambiguous". By accepting non-Ascii digits, we're
breaking the uniqueness requirement of ISO 8601.
----------
components: Library (Lib)
messages: 359695
nosy: cool-RR
priority: normal
severity: normal
status: open
title: Don't allow datetime parsing to accept non-Ascii digits
type: security
versions: Python 3.7, Python 3.8, Python 3.9
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue39280>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com