Re: [PATCH] no_proxy domain matching

Tim Rühsen Wed, 27 Nov 2019 01:53:02 -0800

On 11/26/19 4:00 PM, Tomas Hozza wrote:
> 
> 
> On 20. 11. 2019 18:47, Tim Rühsen wrote:
>> On 20.11.19 12:41, Tomas Hozza wrote:
>>> On 7. 11. 2019 21:30, Tim Rühsen wrote:
>>>> On 07.11.19 15:21, Tomas Hozza wrote:
>>>>> Hi.
>>>>>
>>>>> In RHEL-8, we ship a wget version that suffers from bug fixed by [1]. The 
>>>>> fix resolved issue with matching subdomains when no_proxy domain 
>>>>> definition was prefixed with dot, e.q. "no_prefix=.mit.edu". As part of 
>>>>> backporting the fix to RHEL, I wanted to create an upstream test for 
>>>>> no_prefix functionality. However I found that there is still one corner 
>>>>> case, which is not handled by the current upstream code and honestly I'm 
>>>>> not sure what is the intended domain matching behavior in that case. Man 
>>>>> page is also not very specific in this regard.
>>>>>
>>>>> The corner case is as follows:
>>>>> - no_proxy=.mit.edu
>>>>> - download URL is e.g. "http://mit.edu/file1";
>>>>>
>>>>> In this case the proxy settings are used, because domains don't match due 
>>>>> to the leftmost dot in no_proxy domain definition. This is either 
>>>>> intended or corner case that was not considered. One could argue, that if 
>>>>> the no_proxy is set to ".mit.edu", then leftmost dot means that the proxy 
>>>>> settings should not apply only to subdomains of "mit.edu", but proxy 
>>>>> settings should still apply to "mit.edu" domain itself. From my point of 
>>>>> view, after reading wget man page, I don't think that the leftmost dost 
>>>>> in no_proxy definition has any special meaning.
>>>>
>>>> Hello Tomas,
>>>>
>>>> hard to decide how to handle this. I personally would like to see a
>>>> match with curl's behavior (see https://github.com/curl/curl/issues/1208).
>>>>
>>>> Given the docs from GNU emacs, you are right. "no_proxy=.mit.edu" means
>>>> "mit.edu and subdomains" are excluded from proxy settings.
>>>> (see https://www.gnu.org/software/emacs/manual/html_node/url/Proxies.html)
>>>>
>>>> The caveat with emacs' behavior is that you cannot exclude just all
>>>> subdomains of mit.edu without mit.edu itself. Effectively, that creates
>>>> a corner case that can't be handled at all. (but if curl also does it
>>>> that way, let's go for it).
>>>>
>>>> Maybe you can find out about the current no_proxy behavior of typical
>>>> and wide-spread tools (regarding leftmost dot) !? Once we have that
>>>> information, we can make a confident decision.
>>>>
>>>> Regards, Tim
>>>
>>> Hi Tim.
>>>
>>> It took me some time to go through the current situation and to be honest, 
>>> it is kind of a mess. While each tool handles the no_proxy env a little bit 
>>> differently, there are some similarities. Nevertheless I was not able to 
>>> find any standard.
>>>
>>> curl's behavior:
>>> - "no_proxy=.mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - "no_proxy=mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - downside: can not match only the host; can not match only the domain and 
>>> subdomains
>>>
>>> current wget's behavior:
>>> - "no_proxy=.mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will NOT match the host "mit.edu"
>>> - "no_proxy=mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - downside: can not match only the host
>>>
>>> wget's behavior with proposed patch:
>>> - "no_proxy=.mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - "no_proxy=mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - downside: can not match only the host; can not match only the domain and 
>>> subdomains
>>> - it would be consistent with curl's behavior
>>>
>>> emacs's behavior:
>>> - "no_proxy=.mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - "no_proxy=mit.edu"
>>>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - downside: can not match only subdomains
>>>
>>> python httplib2's behavior:
>>> - "no_proxy=.mit.edu"
>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - "no_proxy=mit.edu"
>>>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
>>> "www.subdomain.mit.edu"
>>>   - will match the host "mit.edu"
>>> - downside: can not match only subdomains
>>>
>>> To sum it up. Each approach has some downsides. Given the change that I 
>>> provided, wget's behavior would be consistent with curl's behavior. However 
>>> it will have more downsides that it currently has, specifically it will 
>>> loose the ability to not to match the host, but only domain and subdomains. 
>>> Emacs's behavior is similar to Python's httplib2 behavior regarding the 
>>> leftmost dot.
>>>
>>> Honestly I have a soft preference for keeping the current wget's behavior. 
>>> But I admit that making the behavior consistent with curl's behavior makes 
>>> sense. Please let me know how you would like to proceed.
>>>
>>> To make the behavior consistent with curl, the previously attached changes 
>>> should be OK. If you find those new conditions too complicated, I can try 
>>> to rethink it, but I already tried to make it as little complicated as 
>>> possible and at the same time trying to not completely rewrite the function.
>>>
>>> If you'll decide to keep the current behavior, I'll modify the test that I 
>>> added to cope with the behavior.
>>
>> Great work, Tomas !
>>
>> Wow, didn't think it's so messed up :-(
>>
>> We should definitely document your results, e.g. in the wget manual.
>>
>> If we keep the current behavior, we could adjust it with a new option or
>> a new env variable 'WGET_NO_PROXY_MODE'. Which could take well-defined
>> values like 'curl', 'emacs', 'wget' (the default), and maybe a new one
>> ('strict') with none of the detected downsides.
>>
>> Looks a bit over-engineered, but it means that wget can easily adopt to
>> existing environments. And the code seems pretty straight forward.
>>
>> Let's see if some more opinions come in.
>>
>> Regards, Tim
> 
> Yes, 'WGET_NO_PROXY_MODE' is probably the safest option with regard to 
> backward compatibility. And if needed, the default could later change. One 
> downside of allowing values like 'curl' or 'emacs' is that these would 
> probably require also handling of asterisk "*" in hostnames, as those tools 
> do.
> 
> Nevertheless for now, I at least modified the new test to cover current wget 
> behavior. There were also minor changes to the test framework in order to 
> make the test possible. Patches are attached. Please let me know if they need 
> any changes.


Please add your test to testenv/Makefile.am (best directly before adding
$(METALINK_TESTS)). The test currently doesn't work for me, since host
`working1.localhost` can't be resolved.

Maybe you can add `testenv/certs/wgethosts`, similar as
`tests/certs/wgethosts`, but with working1.localhost and
working2.localhost included. You have to set env variable HOSTALIASES to
that file name (glibc feature).

In patch 0001 there is a typo 'retuns'.

Please don't make your lines in the commit messages longer than 79 chars.

Not sure if and when I implement WGET_NO_PROXY_MODE. We try to make no
more changes to wget 1.x except bug fixes. All new stuff should only go
into wget2.

Regards, Tim

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] no_proxy domain matching

Reply via email to