[issue39280] Don't allow datetime parsing to accept non-Ascii digits

Ram Rachum Thu, 09 Jan 2020 13:11:40 -0800

New submission from Ram Rachum <r...@rachum.com>:

I've been doing some research into the use of `\d` in regular expressions in 
CPython, and any security vulnerabilities that might happen as a result of the 
fact that it accepts non-Ascii digits like ٢ and ５.


In most places in the CPython codebase, the `re.ASCII` flag is used for such 
cases, thus ensuring the `re` module prohibits these non-Ascii digits. 
Personally, my preference is to never use `\d` and always use `[0-9]`. I think 
that it's rule that's more easy to enforce and less likely to result in a 
slipup, but that's a matter of personal taste.

I found a few places where we don't use the `re.ASCII` flag and we do accept 
non-Ascii digits.

The first and less interesting place is platform.py, where we define patterns 
used for detecting versions of PyPy and IronPython. I don't know how anyone 
would exploit that, but personally I'd change that to a [0-9] just to be safe. 
I've opened bpo-39279 for that. 

The more sensitive place is the `datetime` module. 

Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii 
digits. But the `datetime.datetime.strptime` function does not: 

    from datetime import datetime
    
    time_format = '%Y-%m-%d'
    parse = lambda s: datetime.strptime(s, time_format)
       
    x = '٢019-12-22'
    y = '2019-12-22'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2019-12-22 00:00:00

If user code were to check for uniqueness of a datetime by comparing it as a 
string, this is where an attacker could fool this logic, by using a non-Ascii 
digit.

Two more interesting points about this: 

1. If you'd try the same trick, but you'd insert ٢ in the day section instead 
of the year section, Python would reject that. So we definitely have 
inconsistent behavior.
2. In the documentation for `strptime`, we're referencing the 1989 C standard. 
Since the first version of Unicode was published in 1991, it's reasonable not 
to expect the standard to support digits that were introduced in Unicode.

If you'd scroll down in that documentation, you'll see that we also implement 
the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week 
number, and day of week. The `%G` is vulnerable:
    
    from datetime import datetime
    
    time_format = '%G-%V-%u'
    parse = lambda s: datetime.strptime(s, time_format)
   
    x = '٢019-53-4'
    y = '2019-53-4'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2020-01-02 00:00:00

I looked at the ISO 8601:2004 document, and under the "Fundamental principles" 
chapter, it says:

    This International Standard gives a set of rules for the representation of
        time points
        time intervals
        recurring time intervals.
        Both accurate and approximate representations can be identified by 
means of unique and unambiguous expressions specifying the relevant dates, 
times of day and durations.  

Note the "unique and unambiguous". By accepting non-Ascii digits, we're 
breaking the uniqueness requirement of ISO 8601.

----------
components: Library (Lib)
messages: 359695
nosy: cool-RR
priority: normal
severity: normal
status: open
title: Don't allow datetime parsing to accept non-Ascii digits
type: security
versions: Python 3.7, Python 3.8, Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39280>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39280] Don't allow datetime parsing to accept non-Ascii digits

Reply via email to