[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-13 Thread Stian Soiland-Reyes

Stian Soiland-Reyes added the comment:

I tried to make a patch for this, but I found it quite hard as the 
urllib/parse.py is fairly low-level, e.g. it is constantly encoding/decoding 
bytes and strings within each URI component. Basically the code assumes there 
are tuples of strings, with support for both bytes and strings baked in later.

As you see in 

https://github.com/stain/cpython/compare/issue-2285-urllib-empty-fragment?expand=1

the patch in parse.py is small - but the effect of that in test_urlparse.py is 
a bit bigger, as lots of test are testing for the result of urlsplit to have '' 
instead of None. It is uncertain how much real-life client code also check for 
'' directly. (if not p.fragment would of course still work - but if 
p.fragment == '' would not work anymore.

I therefore suggest an alternative to my patch above - to add some boolean 
fields like has_fragment, thus the existing component fields can keep their 
backwards compatible '' and b'' values even when a component is actually 
missing, and yet allowing geturl() to reconstitute the URI according to the RFC.

--
hgrepos: +279

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22852
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-12 Thread Stian Soiland-Reyes

New submission from Stian Soiland-Reyes:

urllib.parse can't handle URIs with empty #fragments. The fragment is removed 
and not reconsituted.

http://tools.ietf.org/html/rfc3986#section-3.5 permits empty fragment strings:


  URI-reference = [ absoluteURI | relativeURI ] [ # fragment ]
  fragment= *( pchar / / / ? )

And even specifies component recomposition to distinguish from not being 
defined and being an empty string:

http://tools.ietf.org/html/rfc3986#section-5.3


   Note that we are careful to preserve the distinction between a
   component that is undefined, meaning that its separator was not
   present in the reference, and a component that is empty, meaning that
   the separator was present and was immediately followed by the next
   component separator or the end of the reference.


This seems to be caused by missing components being represented as '' instead 
of None.

 import urllib.parse
 urllib.parse.urlparse(http://example.com/file#;)
ParseResult(scheme='http', netloc='example.com', path='/file', params='', 
query='', fragment='')
 urllib.parse.urlunparse(urllib.parse.urlparse(http://example.com/file#;))
'http://example.com/file'

 urllib.parse.urlparse(http://example.com/file#;).geturl()
'http://example.com/file'

 urllib.parse.urlparse(http://example.com/file# ).geturl()
'http://example.com/file# '

 urllib.parse.urlparse(http://example.com/file#nonempty;).geturl()
'http://example.com/file#nonempty'

 urllib.parse.urlparse(http://example.com/file#;).fragment
''

The suggested fix is to use None instead of '' to represent missing components, 
and to check with if fragment is not None instead of if not fragment.


The same issue applies to query and authority. E.g.

http://example.com/file? != http://example.com/file

... but be careful about the implications of

file:///file != file:/file

--
components: Library (Lib)
messages: 231070
nosy: soilandreyes
priority: normal
severity: normal
status: open
title: urllib.parse wrongly strips empty #fragment
versions: Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22852
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com