[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.

Daniel Watkins Thu, 06 May 2021 12:42:43 -0700


Daniel Watkins <dan...@daniel-watkins.co.uk> added the comment:


Hey folks,

Thanks for all the work on this: I really appreciate the efforts to keep Python 
as secure as possible!

This change _is_ causing us problems in the cloud-init codebase, which 
thankfully have been caught by our testing in Ubuntu's development release.  
This is in a fairly deep part of the codebase, so apologies in advance for the 
detailed description.

TL;DR: cloud-init constructs mirror URLs and then sanitises them by replacing 
invalid characters with hyphens.  With the fix for this bug, urlsplit silently 
removes (some of) those characters before we can replace them, modifying the 
output of our sanitisation code, and therefore meaning cloud-init will, albeit 
in fairly specific corner cases, configure different mirrors if run with a 
Python including this fix vs. one that precedes it.

cloud-init constructs mirror URLs based on applying cloud metadata to 
user-configured (or default) templates.  As we're responsible for constructing 
these URLs, we also sanitise them before configuring the package manager to use 
them: specifically, we urlsplit to get the hostname, IDNA-encode (to handle 
non-ASCII input), replace any invalid URL characters with a "-", and then strip 
"-" off each part of the hostname (to handle leading/trailing invalid 
characters), then recombine the URL.  The most common case for this is a cloud 
which specifies values for the variables used in the template with an 
underscore: http://my_openstack_region.cloud.archive.ubuntu.com/ubuntu causes 
Apache mirrors with the default "HTTPProtocolOptions Strict" configuration to 
reject all requests to them (as that's an invalid hostname).  In contrast, 
http://my-openstack-region.cloud.archive.ubuntu.com/ubuntu *is* accepted, so is 
preferable.  (This is important because *.cloud.archive.ubuntu.com exists so 
that l
 ocal cloud admins can DNS "hijack" subdomains of it to point at internal 
servers: even though the Ubuntu mirrors don't reject underscored domains (any 
longer), this is a landmine waiting for any admins running their own mirrors.)  
For more background, see the bug where we figured this all out: 
https://bugs.launchpad.net/cloud-init/+bug/1868232

So, more concretely: if we consider a post-templated URL of 
http://my\topenstack\tregion.mirror.internal/ubuntu, cloud-init changes from 
rewriting that to my-openstack-region.mirror.internal (on < 3.9.5) to 
myopenstackregion.mirror.internal (on 3.9.5+): if, in this notional deployment, 
an apt mirror is running at (exactly) my-openstack-region.mirror.internal, then 
new instance deployments will start failing: they won't be able to install 
packages.  This is the sort of breakage that we aim to avoid in cloud-init 
(because you just _know_ that everyone who deployed this cloud left 
NotionalCorp years ago, so fixing the configuration to remove these 
obviously-incorrect tabs is not necessarily trivial).

Given the current state of the fix here, it's not clear to me how we could 
(cleanly) achieve our desired behaviour.  We could perform replacement of these 
characters before invoking `urlsplit` but that would then substitute these 
characters outside of only the hostname: this is also a change in behaviour.  
We could substitute those characters with magic strings, perform the split, and 
then replace them in the non-hostname parts with the original character and in 
the hostname with hyphens: we've obviously left "cleanly" behind at this point. 
 Another option would be to monkeypatch _UNSAFE_URL_BYTES_TO_REMOVE to an empty 
list: again, not a solution I'd want to have to support across Python versions!

One solution that presents itself to me: add a `strip_insecure_characters: bool 
= True` parameter.  We, in cloud-init, would pass this in as `False`, knowing 
that we're going to handle those ourselves.  Of course, this does leave the 
door open for API users to keep the current insecure behaviour: if library code 
(either public or project-internal) were to default to `False`, then the 
situation is no better than today.

For our use case, at least, I think a more restricted solution would work: 
`url_replacement_char: str = ""`.  We'd call `urlsplit(..., 
url_replacement_char="-")` and the rest of our code would work as it does 
today: from its POV, there were never these invalid chars in the first place.


Thanks once again for the work (and apologies for the wall of text)!


Dan

----------
nosy: +odd_bloke -ned.deily

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43882>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.

Reply via email to