New submission from Daniele Sluijters:

Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept 
URI/URL's with underscores in the host/domain/subdomain. I believe this 
behaviour to be incorrect.

A distinction needs to be made between DNS names and Uniform Resource Locators 
and Identifiers, urlparse is supposed to deal with the latter (correct me if 
I'm wrong).

According to RFC 2181 section 11 on the syntax of DNS names the use of the 
underscore is allowed and in use around the internet, especially in TXT and SRV 
records.

However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates) 
always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.

On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.  

The underscore is never mentioned as being a valid character nor do any of the 
references in the RFC's as far as I've been able to see. 

Languages implementations vary:
 * Ruby URI.parse does not allow for underscores in domain labels.
 * Perl URI and URI::URL allow for underscores.
 * java.net.uri treats the underscore as an illegal character in the domain 
part.
 * org.apache.http.httphost since 4.2.3 treats the underscore as an illegal 
character in the domain part.

Httpd's:
 * Apache: Seems to tolerate underscores but there's been a whole discussion 
about this on the mailing lists.
 * nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems 
to accept server_names with underscores in them but the behaviour is currently 
unknown to me.

Browsers:
 * IE cannot write cookies since IE 5.5 if host or subdomain part includes an 
underscore.
 * Just about every other browser is fine with it.

Please note that I'm only talking about the host/domain/subdomain part of URI's 
and URL's, something like http://en.wikipedia.org/wiki/12-hour_clock is 
perfectly valid and should parse.

----------
components: Library (Lib)
messages: 201730
nosy: daenney, orsenthil
priority: normal
severity: normal
status: open
title: urlparse accepts invalid hostnames
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 
3.4, Python 3.5

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue19451>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to