John Nagle wrote: > Matt Nordhoff wrote: >> John Nagle wrote: >>> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed. >>> ==== > ... > >> >> It's breaking on the first slash, which just happens to be very late in >> the URL. >> >>>>> urlparse('http://example.com?blahblah=http://example.net') >> ('http', 'example.com?blahblah=http:', '//example.net', '', '', '') > > That's what it seems to be doing: > > sa1 = 'http://example.com?blahblah=/foo' > sa2 = 'http://example.com?blahblah=foo' > print urlparse.urlparse(sa1) > ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG > print urlparse.urlparse(sa2) > ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT > > That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic > Syntax"), page 23 says > > "The characters slash ("/") and question mark ("?") may represent data > within the query component. Beware that some older, erroneous > implementations may not handle such data correctly when it is used as > the base URI for relative references (Section 5.1), apparently > because they fail to distinguish query data from path data when > looking for hierarchical separators." > > So "urlparse" is an "older, erroneous implementation". Looking > at the code for "urlparse", it references RFC1808 (1995), which > was a long time ago, three revisions back. > > Here's the bad code: > > def _splitnetloc(url, start=0): > for c in '/?#': # the order is important! > delim = url.find(c, start) > if delim >= 0: > break > else: > delim = len(url) > return url[start:delim], url[delim:] > > That's just wrong. The domain ends at the first appearance of > any character in '/?#', but that code returns the text before the > first '/' even if there's an earlier '?'. A URL/URI doesn't > have to have a path, even when it has query parameters.
"urlparse" doesn't use regular expressions. Is there some good reason for that? It would be easy to fix the code above with a regular expression to break on any char in '/?#'. But urlparse would have to import "re". Is that undesirable? John Nagle -- http://mail.python.org/mailman/listinfo/python-list