Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-14 Thread Dale R. Worley
L A Walsh  writes:
> Dale R. Worley wrote:
>>  But of course, no [RFC3986-conforming] URL
>>  contains an embedded space because that's what it
>>  says in RFC 3986, which is "what *defines* what a
>>  URL *is*"[sic; should read "is one definition of
> a URL.
> ---
> Right, just like speed limit signs define
> what the maximum speed is.
>
> There is the "model" and there is reality.  To believe that
> the model replaces and/or dictates reality is not
> realistic and bordering on some mental pathology.
>
> I understand what you are saying Dale.  My dad was a lawyer,
> and life would be so much easier if specs, RFCs or other
> models of reality were the only thing we had to pay attention
> to.  But... to do so generally creates various levels of
> discomfort and/or headaches.

There's a reason why the Internet has advanced on the back of thousands
of anal-retentive standards documents.

There really are situations where DWIM (Do What I Mean) design makes
life worse.  It's plausible that in a web browser it's reasonable to
allow users to type in purported URLs that are invalid, and for the
browser to make its best guess as to what the user meant.  This is
because getting the guess wrong rarely causes troubles beyond showing
the user a page that they aren't interested in; the user can just retype
the right URL and get what they wanted.

But every such slackness introduces uncertainty.  If the user types
"http://www.example.com/ " (that is, with a trailing space), should it
be handled as "http://www.example.com/%20; (assuming the user wanted to
access a file whose name is a single space, and providing the URL that
does that) or "http://www.example.com/; (assuming that the space is a
cut-and-paste error and should be ignored).

As long as this is being directly monitored by the user, this works
reasonably well.  But once the DWIM program starts being used as a
*part* of a system, things get hazardous.  People start building other
parts of the system assuming that the DWIM program doesn't hold them to
the rules.  And since the DWIM program's behavior in those
outside-the-box cases isn't clearly defined, there's no protection from
the situation where its guesses change, but the rest of the system
depends on *particular* guesses that it used to make.

In the particular case of wget, consider that portions of the URL that
the user enters are extracted and used in the HTTP request.  Again,
there's a strict specification of what constitutes a valid HTTP request.
If the user includes an invalid character in the URL, should wget simply
pass it through into the HTTP request, assuming that a well-built web
browser will Do What the User (probably) Meant?

And it should be remembered that there's a design principle of Unix
that's rarely mentioned:  People write a lot of shell scripts for Unix,
and the external interface of Unix commands is optimized for use within
shell scripts, not for being directly executed by users.  That's why
most of them provide no output whatever if their execution is
successful, and why most of them that do generate output provide no
"headers" -- that would get in the way of handing the output to another
program as input.  I've even seen an exercise in a Unix training book
asking the student to explain why the single header line in the output
of the "ps" command is undesirable.

Within that context, the point of wget is to fetch the contents of a URL
that is provided by something else that *should* know what a properly
formed URL is.

Dale



Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-14 Thread Tim Rühsen
On Mittwoch, 14. Juni 2017 11:49:59 CEST L A Walsh wrote:
> Dale R. Worley wrote:
> >  But of course, no [RFC3986-conforming] URL
> >  contains an embedded space because that's what it
> >  says in RFC 3986, which is "what *defines* what a
> >  URL *is*"[sic; should read "is one definition of
> 
> a URL.
> ---
> Right, just like speed limit signs define
> what the maximum speed is.
> 
> There is the "model" and there is reality.  To believe that
> the model replaces and/or dictates reality is not
> realistic and bordering on some mental pathology.
> 
> I understand what you are saying Dale.  My dad was a lawyer,
> and life would be so much easier if specs, RFCs or other
> models of reality were the only thing we had to pay attention
> to.  But... to do so generally creates various levels of
> discomfort and/or headaches.
> 
> >  Now, someone can provide a string that contains spaces and claim
> >  it's a URL, but it isn't. The question is, What to do with it?  My
> >  preference is to barf and tell the user that what they provided
> >  wasn't a proper URL.
> 
> ---
> I.e.: not doing what you can to give them some output
> that is your _best_ _attempt_ to give them what they wanted
> (excluding dangerous interpretations).
> 
> A friendly user-interface attempts to help the user get
> what they want despite their not asking for it according to
> regulation or with poor syntax or spelling.
> 
> >  Beyond that, one might do some simple tidying up, such as removing
> >  leading and trailing spaces.  That fix, by the way, is known to be
> >  safe, *because a URL can't contain a space*, and so any trailing
> >  space can't actually be part of the URL.
> 
> 
> One might argue that leading and trailing space, since they
> are not "internal" to the URL, aren't really a part of the URL.
> 
> >  It gets uglier when there are invalid characters in the middle of
> >  the URL, because simply deleting them is unlikely to produce the
> >  results the user expected.
> 
> ---
> Yup.  Thus my original post thinking that they should be
> removed since they can't really be part of a URL and as "characters
> non gratis", should be removed before sending them to a remote
> website.

Just in short, there are two 'realities' here
1. The RFC which defines a (part of a) protocol between client and server. 
Clients and servers have to follow this standard, if they deviate they are 
out. This is 'reality' one.

2. User input... well, every (web) client does interpret user input 
differently. But every client tries hard to 'WYGIWYM' (What You Get Is What 
You Mean).
Basically, the problem is solved (or should be) by browsers, so why not do as 
they do ? Well, we can do it similarly but should not forget that 'wget' is a 
'power user' tool while a browser is used by everyone.
People use 'wget' also for very special tasks, e.g. downloading a file which 
name consists of a simple space. Wget would become useless for these people 
(count myself in here) if they couldn't -comfortable- enter a URL with a 
trailing space (wget knows how to escape that, following the RFC).

Example:
  wget 'https://example.com/ '
Should wget download download this space named file or (silently) strip the 
space and download index.html ?
Two answers here, which one has more weight ? Maybe the one that pertains 
disturb backward compatibility !?

> 
> -linda

With Best Regards, Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-14 Thread L A Walsh

Dale R. Worley wrote:

 But of course, no [RFC3986-conforming] URL
 contains an embedded space because that's what it
 says in RFC 3986, which is "what *defines* what a
 URL *is*"[sic; should read "is one definition of

a URL.
---
   Right, just like speed limit signs define
what the maximum speed is.

There is the "model" and there is reality.  To believe that
the model replaces and/or dictates reality is not
realistic and bordering on some mental pathology.

I understand what you are saying Dale.  My dad was a lawyer,
and life would be so much easier if specs, RFCs or other
models of reality were the only thing we had to pay attention
to.  But... to do so generally creates various levels of
discomfort and/or headaches.



 Now, someone can provide a string that contains spaces and claim
 it's a URL, but it isn't. The question is, What to do with it?  My
 preference is to barf and tell the user that what they provided
 wasn't a proper URL.

---
   I.e.: not doing what you can to give them some output
that is your _best_ _attempt_ to give them what they wanted
(excluding dangerous interpretations). 


   A friendly user-interface attempts to help the user get
what they want despite their not asking for it according to
regulation or with poor syntax or spelling.




 Beyond that, one might do some simple tidying up, such as removing
 leading and trailing spaces.  That fix, by the way, is known to be
 safe, *because a URL can't contain a space*, and so any trailing
 space can't actually be part of the URL.


   One might argue that leading and trailing space, since they
are not "internal" to the URL, aren't really a part of the URL.


 It gets uglier when there are invalid characters in the middle of
 the URL, because simply deleting them is unlikely to produce the
 results the user expected.

---
   Yup.  Thus my original post thinking that they should be
removed since they can't really be part of a URL and as "characters
non gratis", should be removed before sending them to a remote
website.

-linda