Bugs item #548176, was opened at 2002-04-24 17:36 Message generated for change (Comment added) made by jlgijsbers You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470
Category: Python Library Group: Python 2.4 >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Markus Demleitner (msdemlei) Assigned to: Nobody/Anonymous (nobody) Summary: urlparse doesn't handle host?bla Initial Comment: The urlparse module (at least in 2.2 and 2.1, Linux) doesn't handle URLs of the form http://www.maerkischeallgemeine.de?loc_id=49 correctly -- everything up to the 9 ends up in the host. I didn't check the RFC, but in the real world URLs like this do show up. urlparse works fine when there's a trailing slash on the host name: http://www.maerkischeallgemeine.de/?loc_id=49 Example: <pre> >>> import urlparse >>> urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49") ('http', 'www.maerkischeallgemeine.de', '/', '', 'loc_id=49', '') >>> urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49") ('http', 'www.maerkischeallgemeine.de?loc_id=49', '', '', '', '') </pre> This has serious implications for urllib, since urllib.urlopen will fail for URLs like the second one, and with a pretty mysterious exception ("host not found") at that. ---------------------------------------------------------------------- >Comment By: Johannes Gijsbers (jlgijsbers) Date: 2005-01-09 16:33 Message: Logged In: YES user_id=469548 Fixed by applying patch #712317 on maint24 and HEAD. ---------------------------------------------------------------------- Comment By: Paul Moore (pmoore) Date: 2004-11-08 21:48 Message: Logged In: YES user_id=113328 This issue still exists in Python 2.3.4 and Python 2.4b2. ---------------------------------------------------------------------- Comment By: Mike Rovner (mrovner) Date: 2004-10-23 09:44 Message: Logged In: YES user_id=162094 I'm sorry, I misunderstood the patch. If it accepts such URL and split it at '?', it's perfectly fine. It shall not reject such URL as malformed. ---------------------------------------------------------------------- Comment By: Johannes Gijsbers (jlgijsbers) Date: 2004-10-23 09:03 Message: Logged In: YES user_id=469548 Somehow I think I'm missing something. Please check my line of reasoning: 1. http://foo?bar=baz is a legal URL. 2. urlparse's 'Network location' should be the same as <authority> from rfc2396. 3. Inside <authority> an unescaped '?' is not allowed. Rather: <authority> is terminated by the '?'. 4. Currently the 'network location' for http://foo?bar=baz would be 'foo?bar=baz. 5. If 'network location' should be the same as <authority>, it should also be terminated by the '?'. So shouldn't urlparse.urlsplit('http://foo?bar=baz') return ('http', 'foo', '', '', 'bar=baz', ''), as patch 712317 implements? ---------------------------------------------------------------------- Comment By: Mike Rovner (mrovner) Date: 2004-01-27 02:13 Message: Logged In: YES user_id=162094 According to RFC2396 (ftp://ftp.isi.edu/in-notes/rfc2396.txt) absoluteURI (part 3 URI Syntactic Components) can be: """ <scheme>://<authority><path>?<query> each of which, except <scheme>, may be absent from a particular URI. """ Later on (3.2): """ The authority component is preceded by a double slash "//" and is terminated by the next slash "/", question-mark "?", or by the end of the URI. """ So URL "http://server?query" is perfectly legal and shall be allowed and patch 712317 rejected. ---------------------------------------------------------------------- Comment By: Steven Taschuk (staschuk) Date: 2003-03-30 22:19 Message: Logged In: YES user_id=666873 For comparison, RFC 1738 section 3.3: An HTTP URL takes the form: http://<host>:<port>/<path>?<searchpart> [...] If neither <path> nor <searchpart> is present, the "/" may also be omitted. ... which does not outright say the '/' may *not* be omitted if <path> is absent but <searchpart> is present (though imho that's implied). But even if the / may not be omitted in this case, ? is not allowed in the authority component under either RFC 2396 or RFC 1738, so urlparse should either treat it as a delimiter or reject the URL as malformed. The principle of being lenient in what you accept favours the former. I've just submitted a patch (712317) for this. ---------------------------------------------------------------------- Comment By: Jeff Epler (jepler) Date: 2002-11-17 17:56 Message: Logged In: YES user_id=2772 This actually appears to be permitted by RFC2396 [http://www.ietf.org/rfc/rfc2396.txt]. See section 3.2: 3.2. Authority Component Many URI schemes include a top hierarchical element for a naming authority, such that the namespace defined by the remainder of the URI is governed by that authority. This authority component is typically defined by an Internet-based server or a scheme-specific registry of naming authorities. authority = server | reg_name The authority component is preceded by a double slash "//" and is terminated by the next slash "/", question-mark "?", or by the end of the URI. Within the authority component, the characters ";", ":", "@", "?", and "/" are reserved. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com