[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2022-02-12 Thread Éric Araujo

Éric Araujo  added the comment:

See also #46337

--
nosy: +eric.araujo
versions: +Python 3.11 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2020-06-10 Thread Open Close


Open Close  added the comment:

I found another related issue (issue37969).

I also filed one myself (issue 40938).

---

One thing against the 'has_netloc' etc. solution is that
while it guarantees round-trips (urlunsplit(urlsplit('...')) etc.),
it is conditional on 'urlunsplit' getting 'SplitResult' object.
When 'urlunsplit' gets a normal tuple, it is helpless.

--
nosy: +op368

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2018-08-04 Thread Martin Panter

Martin Panter  added the comment:

I like this option. I suppose choosing which option to take is a compromise 
between compatiblity and simplicity. In the short term, the “allows_none” 
option requires user code to be updated. In the long term it may break 
compatibility. But the “has_netloc” etc option is more complex.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2018-07-31 Thread Piotr Dobrogost


Change by Piotr Dobrogost :


--
nosy: +piotr.dobrogost

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2018-07-31 Thread Chris Jerdonek


Chris Jerdonek  added the comment:

I just learned of this issue. Rather than adding has_netloc, etc. attributes, 
why not use None to distinguish missing values as is preferred above, but add a 
new boolean keyword argument to urlparse(), etc. to get the new behavior (e.g. 
"allow_none" to parallel "allow_fragments")?

It seems like this would be more elegant, IMO, because it would lead to the API 
we really want. For example, the ParseResult(), etc. signatures and repr() 
values would be simpler. Changing the default value of the new keyword 
arguments would also provide a clean and simple deprecation pathway in the 
future, if desired.

--
nosy: +chris.jerdonek

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2015-08-02 Thread Robert Collins

Robert Collins added the comment:

See also issue 6631

--
nosy: +rbcollins

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2015-05-31 Thread Berker Peksag

Changes by Berker Peksag :


--
nosy: +berker.peksag

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment, ?query, //netloc

2015-05-30 Thread Martin Panter

Martin Panter added the comment:

Anyone want to review my new patch? This is a perennial issue; see all the 
duplicates I just linked.

--
keywords: +needs review
title: urllib.parse wrongly strips empty #fragment -> urllib.parse wrongly 
strips empty #fragment, ?query, //netloc

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-22 Thread Martin Panter

Martin Panter added the comment:

Posting patch v2 with these changes:

* Split out scheme documentation fixes to Issue 23684.
* Renamed _NetlocResultMixinBase → _SplitParseBase
* Explained the default values of the flags better, and what None means
* Changed to Demian’s forward-looking “version changed” notices. I decided it 
is okay because the same information can be inferred.
* Tweaked urlunsplit() and urlunparse() descriptions

--
Added file: http://bugs.python.org/file38633/has_netloc.v2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Demian Brecht

Demian Brecht added the comment:

> I cannot imagine some existing code (other than an exploit) that would be 
> broken by restoring the empty “//” component; do you have an example?

You're likely right about the usage (I can't think of a plausible use case at 
any rate).

At first read of #23505, I think I agree with the changes you've made to allow 
for successful round-tripping, but I'd think it'll have to be vetted by at 
least more than one core dev.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Demian Brecht

Demian Brecht added the comment:

> I avoided making them positional parameters, as they are not part of the 
> underlying tuple object.

Ignore me, I was off my face and you're absolutely correct.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Martin Panter

Martin Panter added the comment:

Regarding unparsing of "evil.com", see Issue 23505, where the invalid 
behaviour is pointed out as a security issue. This was one of the bugs that 
motivated me to make this patch. I cannot imagine some existing code (other 
than an exploit) that would be broken by restoring the empty “//” component; do 
you have an example?

Why do you think the asterisks (*) in the Split/ParseResult signatures are 
misleading? I am trying to document that the has_ flags are keyword-only 
parameters. I avoided making them positional parameters, as they are not part 
of the underlying tuple object.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Demian Brecht

Demian Brecht added the comment:

 urlsplit("evil.com").netloc
> ''
 urlsplit("evil.com").has_netloc
> True
 urlunsplit(urlsplit("evil.com"))  # Adds “//” back
> 'evil.com'

RFC 3986, section 3.3:

   If a URI contains an authority component, then the path component
   must either be empty or begin with a slash ("/") character.  If a URI
   does not contain an authority component, then the path cannot begin
   with two slash characters ("//").

Because this is a backwards incompatible behavioural change and is just as 
invalid as far as the RFC goes, I think that the current behaviour should be 
preserved. Even though it's still incorrect, it won't break existing code if 
left unchanged.

> ## _NetlocResultMixinBase abuse ##
> 
> The _NetlocResultMixinBase class is a common class used by the four result 
> classes I’m interested in. I probably should rename it to something like 
> _SplitParseMixinBase, since it is the common base to both urlsplit() and 
> urlparse() results.

I think I'm coming around to this and realizing that it's actually quite close 
to my proposal, the major difference being the additional level of hierarchy in 
mine. My issue was mostly due to the addition of the variadic signature in the 
docs (i.e. line 407 here: 
http://bugs.python.org/review/22852/diff/14176/Doc/library/urllib.parse.rst) 
which led me to believe a nonsensical signature would be valid. After looking 
at it again, __new__ is still bound to the tuple's signature, so you still get 
the following:

>>> SplitResult('scheme','authority','path','query','fragment','foo','bar','baz')
Traceback (most recent call last):
  File "", line 1, in 
  File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 137, in __new__
self = super().__new__(type, *pos, **kw)
TypeError: __new__() takes 6 positional arguments but 9 were given

So I'm less opposed to this as-is. I would like to see the "*" removed from the 
docs though as it's misleading in the context of each of (Split|Parse)Result. I 
do agree that renaming _NetlocResultMixinBase would be helpful, but it might 
also be nice (from a pedant's point of view) to remove "mixin" altogether if 
the __new__ implementation stays as-is.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Martin Panter

Martin Panter added the comment:

## Inferring flags ##

The whole reason for the has_netloc etc flags is that I don’t think we can 
always infer their values, so we have to explicitly remember them. Consider the 
following two URLs, which I think should both have empty “netloc” strings for 
backwards compatibility, but should be handled differently by urlunsplit():

>>> urlsplit("evil.com").netloc
''
>>> urlsplit("evil.com").has_netloc
True
>>> urlunsplit(urlsplit("evil.com"))  # Adds “//” back
'evil.com'
>>> urlsplit("/normal/path").netloc
''
>>> urlsplit("/normal/path").has_netloc
False
>>> urlunsplit(urlsplit("/normal/path"))  # Does not add “//”
'/normal/path'

## _NetlocResultMixinBase abuse ##

The _NetlocResultMixinBase class is a common class used by the four result 
classes I’m interested in. I probably should rename it to something like 
_SplitParseMixinBase, since it is the common base to both urlsplit() and 
urlparse() results.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-16 Thread Demian Brecht

Demian Brecht added the comment:

I've done an initial pass in Rietveld and left some comments, mostly around 
docs. Here are some additional questions though:

Given has_* flags can be inferred during instantiation of *Result classes, is 
there a reason to have them writable, meaning is there a reason to add them to 
the __init__ methods?

I'd also like to see this logic moved out of _NetlocResultMixinBase. I'm 
assuming it was put there for backwards compatibility which is understandable, 
but I don't think it makes sense to add such logic to a mixin who's purpose is 
additional functionality around the netloc attribute. This one's a little more 
subjective though, but here's a rough idea of what I'm thinking:

SplitResult = namedtuple('SplitResult', 'scheme netloc path query fragment')
class _SplitResultBase(SplitResult):
def __new__(cls, scheme, netloc, path, query, fragment):
inst = super().__new__(cls, scheme, netloc, path, query, fragment)
inst.has_netloc = bool(netloc)
return inst

>>> s = urlsplit('http://example.com/foo/bar/')
>>> s.has_netloc
True

This keeps backwards compatibility, but also adds the additional logic to the 
bases rather than in the mixins. I might also split out the logic into helper 
functions in order to avoid duplication between _SplitResultBase and 
_ParseResultBase.

This method also avoids the dependency on ordering of base classes as well as 
the addition of a variadic signature to (Split|Parse)Result.__init__.

Thoughts?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-12 Thread Demian Brecht

Changes by Demian Brecht :


--
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-12 Thread Martin Panter

Martin Panter added the comment:

There have been a few recent bug reports (Issue 23505, Issue 23636) that may be 
solved by the has_netloc proposal. So I am posting a patch implementing it. The 
changes were a bit more involved than I anticipated, but should still be usable.

I reused some of Stian’s tests, however the results are slightly different in 
my patch, matching the existing behaviour:

* Never sets netloc, query, fragment to None
* Always leaves hostname as None rather than ""
* Retains username, password and port components in netloc
* Converts hostname to lowercase

Unfortunately I discovered that you cannot add __slots__ to namedtuple() 
subclasses; see Issue 17295 and Issue 1173475. Therefore in my patch I have 
removed __slots__ from the SplitResult etc classes, so that those classes can 
gain the has_netloc etc attributes.

I chose to make the default has_netloc value based on existing urlunsplit() 
behaviour:

>>> empty_netloc = ""
>>> SplitResult("mailto", empty_netloc, "ch...@example.com", "", "").has_netloc
False
>>> SplitResult("file", empty_netloc, "/path", "", "").has_netloc
True

I found out that the “urllib.robotparser” module uses a urlunparse(urlparse()) 
combination to normalize URLs, so had to be changed. This is a backwards 
incompatibility of this proposal.

--
keywords: +patch
type:  -> enhancement
Added file: http://bugs.python.org/file38465/has_netloc.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-03-06 Thread Demian Brecht

Changes by Demian Brecht :


--
nosy: +demian.brecht

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2015-02-08 Thread Martin Panter

Martin Panter added the comment:

I also liked the idea of returning None to distinguish a missing URL component 
from an empty-but-present component, and it would make them more consistent 
with the “username” and “password” fields. But I agree it would break backwards 
compabitility too much. The idea of “has_fragment” etc flags might work. The 
ParseResult etc class signatures could be expanded like this:

SplitResult(scheme, netloc, path, query, fragment, *, has_netloc=None, 
has_query=None, has_fragment=None)

>>> url1 = SplitResult("http", "localhost", "/path", query="", fragment="")
>>> url1.has_netloc
True
>>> url1.has_fragment
False
>>> url2 = SplitResult("http", "localhost", "/path", query="", fragment="", 
>>> has_fragment=True)
>>> url2.has_fragment
True
>>> url2.has_query
False
>>> url2.geturl()
"http://localhost/path#";

Is it also worth adding “has_params” for urlparse()?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-13 Thread Stian Soiland-Reyes

Stian Soiland-Reyes added the comment:

I tried to make a patch for this, but I found it quite hard as the 
urllib/parse.py is fairly low-level, e.g. it is constantly encoding/decoding 
bytes and strings within each URI component. Basically the code assumes there 
are tuples of strings, with support for both bytes and strings baked in later.

As you see in 

https://github.com/stain/cpython/compare/issue-2285-urllib-empty-fragment?expand=1

the patch in parse.py is small - but the effect of that in test_urlparse.py is 
a bit bigger, as lots of test are testing for the result of urlsplit to have '' 
instead of None. It is uncertain how much real-life client code also check for 
'' directly. ("if not p.fragment" would of course still work - but "if 
p.fragment == ''" would not work anymore.

I therefore suggest an alternative to my patch above - to add some boolean 
fields like has_fragment, thus the existing component fields can keep their 
backwards compatible '' and b'' values even when a component is actually 
missing, and yet allowing geturl() to reconstitute the URI according to the RFC.

--
hgrepos: +279

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-12 Thread Martin Panter

Changes by Martin Panter :


--
nosy: +vadmium

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-12 Thread Georg Brandl

Changes by Georg Brandl :


--
nosy: +orsenthil

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22852] urllib.parse wrongly strips empty #fragment

2014-11-12 Thread Stian Soiland-Reyes

New submission from Stian Soiland-Reyes:

urllib.parse can't handle URIs with empty #fragments. The fragment is removed 
and not reconsituted.

http://tools.ietf.org/html/rfc3986#section-3.5 permits empty fragment strings:


  URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
  fragment= *( pchar / "/" / "?" )

And even specifies component recomposition to distinguish from not being 
defined and being an empty string:

http://tools.ietf.org/html/rfc3986#section-5.3


   Note that we are careful to preserve the distinction between a
   component that is undefined, meaning that its separator was not
   present in the reference, and a component that is empty, meaning that
   the separator was present and was immediately followed by the next
   component separator or the end of the reference.


This seems to be caused by missing components being represented as '' instead 
of None.

>>> import urllib.parse
>>> urllib.parse.urlparse("http://example.com/file#";)
ParseResult(scheme='http', netloc='example.com', path='/file', params='', 
query='', fragment='')
>>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/file#";))
'http://example.com/file'

>>> urllib.parse.urlparse("http://example.com/file#";).geturl()
'http://example.com/file'

>>> urllib.parse.urlparse("http://example.com/file# ").geturl()
'http://example.com/file# '

>>> urllib.parse.urlparse("http://example.com/file#nonempty";).geturl()
'http://example.com/file#nonempty'

>>> urllib.parse.urlparse("http://example.com/file#";).fragment
''

The suggested fix is to use None instead of '' to represent missing components, 
and to check with "if fragment is not None" instead of "if not fragment".


The same issue applies to query and authority. E.g.

http://example.com/file? != http://example.com/file

... but be careful about the implications of

file:///file != file:/file

--
components: Library (Lib)
messages: 231070
nosy: soilandreyes
priority: normal
severity: normal
status: open
title: urllib.parse wrongly strips empty #fragment
versions: Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com