Re: [PATCH] no_proxy domain matching

Tim Rühsen Sat, 11 Jul 2020 14:05:07 -0700

Thank you, Thomas, patches applied and pushed.

Regards, Tim


On 10.07.20 15:34, Tomas Hozza wrote:
>
>
> On 27. 11. 2019 10:50, Tim Rühsen wrote:
>> On 11/26/19 4:00 PM, Tomas Hozza wrote:
>>>
>>>
>>> On 20. 11. 2019 18:47, Tim Rühsen wrote:
>>>> On 20.11.19 12:41, Tomas Hozza wrote:
>>>>> On 7. 11. 2019 21:30, Tim Rühsen wrote:
>>>>>> On 07.11.19 15:21, Tomas Hozza wrote:
>>>>>>> Hi.
>>>>>>>
>>>>>>> In RHEL-8, we ship a wget version that suffers from bug fixed by [1]. 
>>>>>>> The fix resolved issue with matching subdomains when no_proxy domain 
>>>>>>> definition was prefixed with dot, e.q. "no_prefix=.mit.edu". As part of 
>>>>>>> backporting the fix to RHEL, I wanted to create an upstream test for 
>>>>>>> no_prefix functionality. However I found that there is still one corner 
>>>>>>> case, which is not handled by the current upstream code and honestly 
>>>>>>> I'm not sure what is the intended domain matching behavior in that 
>>>>>>> case. Man page is also not very specific in this regard.
>>>>>>>
>>>>>>> The corner case is as follows:
>>>>>>> - no_proxy=.mit.edu
>>>>>>> - download URL is e.g. "http://mit.edu/file1";
>>>>>>>
>>>>>>> In this case the proxy settings are used, because domains don't match 
>>>>>>> due to the leftmost dot in no_proxy domain definition. This is either 
>>>>>>> intended or corner case that was not considered. One could argue, that 
>>>>>>> if the no_proxy is set to ".mit.edu", then leftmost dot means that the 
>>>>>>> proxy settings should not apply only to subdomains of "mit.edu", but 
>>>>>>> proxy settings should still apply to "mit.edu" domain itself. From my 
>>>>>>> point of view, after reading wget man page, I don't think that the 
>>>>>>> leftmost dost in no_proxy definition has any special meaning.
>>>>>>
>>>>>> Hello Tomas,
>>>>>>
>>>>>> hard to decide how to handle this. I personally would like to see a
>>>>>> match with curl's behavior (see 
>>>>>> https://github.com/curl/curl/issues/1208).
>>>>>>
>>>>>> Given the docs from GNU emacs, you are right. "no_proxy=.mit.edu" means
>>>>>> "mit.edu and subdomains" are excluded from proxy settings.
>>>>>> (see 
>>>>>> https://www.gnu.org/software/emacs/manual/html_node/url/Proxies.html)
>>>>>>
>>>>>> The caveat with emacs' behavior is that you cannot exclude just all
>>>>>> subdomains of mit.edu without mit.edu itself. Effectively, that creates
>>>>>> a corner case that can't be handled at all. (but if curl also does it
>>>>>> that way, let's go for it).
>>>>>>
>>>>>> Maybe you can find out about the current no_proxy behavior of typical
>>>>>> and wide-spread tools (regarding leftmost dot) !? Once we have that
>>>>>> information, we can make a confident decision.
>>>>>>
>>>>>> Regards, Tim
>>>>>
>>>>> Hi Tim.
>>>>>
>>>>> It took me some time to go through the current situation and to be 
>>>>> honest, it is kind of a mess. While each tool handles the no_proxy env a 
>>>>> little bit differently, there are some similarities. Nevertheless I was 
>>>>> not able to find any standard.
>>>>>
>>>>> curl's behavior:
>>>>> - "no_proxy=.mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - "no_proxy=mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - downside: can not match only the host; can not match only the domain 
>>>>> and subdomains
>>>>>
>>>>> current wget's behavior:
>>>>> - "no_proxy=.mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will NOT match the host "mit.edu"
>>>>> - "no_proxy=mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - downside: can not match only the host
>>>>>
>>>>> wget's behavior with proposed patch:
>>>>> - "no_proxy=.mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - "no_proxy=mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - downside: can not match only the host; can not match only the domain 
>>>>> and subdomains
>>>>> - it would be consistent with curl's behavior
>>>>>
>>>>> emacs's behavior:
>>>>> - "no_proxy=.mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - "no_proxy=mit.edu"
>>>>>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - downside: can not match only subdomains
>>>>>
>>>>> python httplib2's behavior:
>>>>> - "no_proxy=.mit.edu"
>>>>>   - will match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - "no_proxy=mit.edu"
>>>>>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
>>>>> "www.subdomain.mit.edu"
>>>>>   - will match the host "mit.edu"
>>>>> - downside: can not match only subdomains
>>>>>
>>>>> To sum it up. Each approach has some downsides. Given the change that I 
>>>>> provided, wget's behavior would be consistent with curl's behavior. 
>>>>> However it will have more downsides that it currently has, specifically 
>>>>> it will loose the ability to not to match the host, but only domain and 
>>>>> subdomains. Emacs's behavior is similar to Python's httplib2 behavior 
>>>>> regarding the leftmost dot.
>>>>>
>>>>> Honestly I have a soft preference for keeping the current wget's 
>>>>> behavior. But I admit that making the behavior consistent with curl's 
>>>>> behavior makes sense. Please let me know how you would like to proceed.
>>>>>
>>>>> To make the behavior consistent with curl, the previously attached 
>>>>> changes should be OK. If you find those new conditions too complicated, I 
>>>>> can try to rethink it, but I already tried to make it as little 
>>>>> complicated as possible and at the same time trying to not completely 
>>>>> rewrite the function.
>>>>>
>>>>> If you'll decide to keep the current behavior, I'll modify the test that 
>>>>> I added to cope with the behavior.
>>>>
>>>> Great work, Tomas !
>>>>
>>>> Wow, didn't think it's so messed up :-(
>>>>
>>>> We should definitely document your results, e.g. in the wget manual.
>>>>
>>>> If we keep the current behavior, we could adjust it with a new option or
>>>> a new env variable 'WGET_NO_PROXY_MODE'. Which could take well-defined
>>>> values like 'curl', 'emacs', 'wget' (the default), and maybe a new one
>>>> ('strict') with none of the detected downsides.
>>>>
>>>> Looks a bit over-engineered, but it means that wget can easily adopt to
>>>> existing environments. And the code seems pretty straight forward.
>>>>
>>>> Let's see if some more opinions come in.
>>>>
>>>> Regards, Tim
>>>
>>> Yes, 'WGET_NO_PROXY_MODE' is probably the safest option with regard to 
>>> backward compatibility. And if needed, the default could later change. One 
>>> downside of allowing values like 'curl' or 'emacs' is that these would 
>>> probably require also handling of asterisk "*" in hostnames, as those tools 
>>> do.
>>>
>>> Nevertheless for now, I at least modified the new test to cover current 
>>> wget behavior. There were also minor changes to the test framework in order 
>>> to make the test possible. Patches are attached. Please let me know if they 
>>> need any changes.
>
> Hi Tim.
>
> I finally got to finishing the requested changes. Modified patches are 
> attached.
>
>> Please add your test to testenv/Makefile.am (best directly before adding
>> $(METALINK_TESTS)). The test currently doesn't work for me, since host
>> `working1.localhost` can't be resolved.
>
> I added the test to testenv/Makefile.am.
>
>> Maybe you can add `testenv/certs/wgethosts`, similar as
>> `tests/certs/wgethosts`, but with working1.localhost and
>> working2.localhost included. You have to set env variable HOSTALIASES to
>> that file name (glibc feature).
>
> I was surprised by this, since I was not able to reproduce this issue at 
> first. However later I found out that since Fedora uses by default systemd's 
> nss plugin "myhostname", which translates any localhost subdomain to 
> localhost address, it magically works.
>
> I tried your suggestion, however based on the documentation of HOSTALIASES 
> and my experiments, the HOSTALIASES works only for hostnames without dot. So 
> in this case, when I need to check even subdomains of working1.localhost 
> (like www.working1.localhost), it does not work. It does not even work for 
> working1.localhost. Honestly I didn't find any easy way how to make this work 
> on a system which does not support translating any localhost subdomains to 
> localhost address.
>
> After some thinking, I came up with at least a workaround, to skip this test 
> if the system can not translate necessary localhost subdomains. This is 
> checked at the beginning of the test and if the resolution fails, the test is 
> skipped. However on systems that support it (like Fedora, RHEL), the test is 
> run and passes.
>
> If you have some additional ideas about how to make this work everywhere, 
> I'll be more than happy to change the test. Unfortunately I'm out of ideas 
> and skipping the test in some cases makes it at least possible to include in 
> the upstream repository (so that it does not make unnecessarily the test 
> suite to fail).
>
>> In patch 0001 there is a typo 'retuns'.
>
> Fixed.
>
>> Please don't make your lines in the commit messages longer than 79 chars.
>
> Fixed.
>
>> Not sure if and when I implement WGET_NO_PROXY_MODE. We try to make no
>> more changes to wget 1.x except bug fixes. All new stuff should only go
>> into wget2.
>>
>> Regards, Tim
>>
> Regards,
> Tomas

Re: [PATCH] no_proxy domain matching

Reply via email to