[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.

Gregory P. Smith Fri, 07 May 2021 12:15:58 -0700


Gregory P. Smith <[email protected]> added the comment:


There is no less intrusive fix as far as I can see.  I believe we're down to 
either stick with what we've done, or do nothing.  It doesn't have to be the 
same choice in all release branches, being more conservative with changes the 
older the stable branch is okay.  (ie: removing this from 3.6 and 3.7 seems 
fine even if more recent ones do otherwise)

Based on my testing, raising an exception is more intrusive to existing tests 
(which we can only ever hope is representative of code) than stripping.  At 
least as exposed by running the changes through many tens of thousands of 
unittest suites at work.

ie: if we raise an exception, pandas.read_json() starts failing because that 
winds up using urlsplit in hopes of extracting the scheme and comparing that to 
known values as their method of deciding if something should be treated as a 
URL to data rather than data.  Pandas would need to be fixed.

That urlsplit() API use pattern is repeated in various other pieces of code: 
urlsplit is not expected to raise an exception.  The caller then has a 
conditional or two testing some parts of the urlsplit result to make a guess as 
to if something should be considered a URL or not.  Doing code inspection, 
pandas included, this code pretty much always then goes on to pass the original 
url value off to some other library, be it urllib, or requests, or ...).

Consequences of that code inspection finding?  With our existing character 
stripping change, new data is then allowed to pass through these urlsplit uses 
and be considered a URL.  Which leads to some code sending the url with 
embedded \r\n\t chars on to other APIs - a concern expressed a couple of times 
above.

Even though urlsplit isn't itself a validation API, it gets used as an early 
step in peoples custom identification and validation attempts.  So *any* change 
we make to it at all in any way breaks someones expectations, even if they 
probably shouldn't have had those expectations and aren't doing wise validation.

Treat this analysis as a sign that we should provide an explicit url validator 
because almost everyone is doing it some form of wrong. (issue43883)

I did wonder if Mike's suggestion of removing the characters during processing, 
but leaving them in the final result in 
https://bugs.python.org/issue43882#msg393033 is feasible as remediation for 
this?  My gut feeling is that it isn't.  It doesn't solve the problem of 
preventing the bad data from going where it shouldn't.  Even if we happen to 
parse that example differently, the unwanted characters are still retained in 
other places they don't belong.  Fundamantelly: We require people to make a 
different series of API call and choices in the end user code to **explicitly 
not use unvalidated inputs**.  Our stdlib API surface can't satisfy that today 
and use of unvalidated data in wrong places is a broad software security 
antipattern theme.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43882>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.

Reply via email to