Re: pathname expansion and double-quotes (was: Another $@ expansion issue)

Robert Elz Fri, 01 Mar 2019 13:30:14 -0800

    Date:        Fri, 1 Mar 2019 16:05:07 +0000
    From:        Geoff Clare <g...@opengroup.org>
    Message-ID:  <20190301160507.GA26998@lt2.masqnet>


Again, replying more or less backwards:

  | I think these issues with 2.6.2 and 2.6.3 should be handled separately
  | from the $@ one (hence the change of subject line).

That's fine, and a good idea ... and I suspect that in any case, when
we're done, there may end up being several different, but related, issues
added to mantis (another, which is a consequence of what we have been
discussing, but logically separate, I will mention just below, as it is
relevant to this point as well, and yet another, much further down.)

  | So I think rather than deleting that text, it should be changed to
  | something like:
  |
  |      If a parameter expansion occurs inside double-quotes and the word
  |      expansions being performed include pathname expansion,

The second half of that I would write as "and is in a context where
pathname expansion occurs"

This is the first probable extra issue I just referred to.   I'd like to
add something like the following in the prelude of 2.6.   (Some of this,
perhaps even most of this, is already there, and I haven't yet sought
to integrate what is new here with what already exists, so don't treat
the following as planned text, just an idea to be done correctly later.)

        The expansions defined in this section take place when, and only
        when, elsewhere in this standard specifies that some or all of the
        word expansions occur.   That may be done by listing some subset of
        these expansions which are to occur, or simply by referring to
        them all, for example "the words are expanded" or "the expanded
        word" in which case all of the expansions defined here occur.
        In other cases, only the specified expansions occur.  Because
        the results of some expansions can vary depending upon which
        other expansions will be, or have been, performed, this standard
        uses the phrase "in a context where <i>expansion</i> occurs", or
        its negated     form, which shall be interpreted to mean that the
        if the named <i>expansion</i> was, or will be,  performed (or not
        performed in the negated version) by explicitly, or implicitly
        by specifying all, selected by the section of this standard
        which specifies that the word expansion occurs, then the condition
        is true, and will cause whatever other processing is said to occur,
        or not occur, in that case.

        When word expansions occur, and regardless of how they are listed,
        including in which order, the expansions which are to occur,
        always occur in the order defined in this section, with any
        expansions that are not to occur simply omitted.

All that is really trying to add, which is not already there, I think,
is a definition of what "in a context where xxx occurs" means, and an
explicit statement that expansions only ever happen when something in
the standard says they do.

As usual, my actual wording will need fixing by someone else (once it
really appears ... what is above will certainly not be it!)

And actually (more "added later" words here) we probably also need to
say that whenever the standard requires parameter expansion, but not
command substitution or arithmetic expansion (eg: expanding ENV or PS1)
the implementation is permitted to perform whichever of those 3 was
not specified by the standard as an extension.    That is, a user cannot
complain that their carefully planned
        PS1='$((1 + 1)) '
is printed as   "2 " by some shell, rather than the "$((1 + 1)) " they
were expecting, because the standard only specifies parameter expansion,
not arithmetic (though for PS1 arith might have been added, but I don't
think that has happened for ENV).   I think that specifying this option
for the implementation in the 2.6 prelude makes more sense than
specifying it over and over again in the remainder of the standard.


  |      all
  |      characters that result from the parameter expansion shall be
  |      treated as literal characters for the purposes of pathname
  |      expansion.

for the rest of that,, and a suggested other way of saying it, we first
need to go back to an earlier part of your message:


  | Having thought some more about this part, I think what the first bullet
  | is trying to say is the point you made when you said, for var=.? :
  |
  |     What the quotes did, was not to prevent pathname expansion from
  |     happening, but to prevent the '?' being interpreted as "match any
  |     character" but instead be "match a question mark".

Yes, I agree that is what it thinks it needs to do.

But:

  | Without a statement at this point, there is nothing in the standard
  | that says the '?' must be treated as a literal character.  There is
  | text that says for:
  |
  |     echo *".?"
  |
  | the '?' is a literal character (2.2.3 Double-Quotes), but that text
  | doesn't apply when the '?' results from the expansion rather than
  | being directly double-quoted.

I disagree with that.   The reasoning is somewhat convoluted, but it
is actually already specified.   Because it is somewhat roundabout,
this may end up being a case where we want to add some text for clarification
but if we do, I think it ought to be descriptive, rather than prescriptive,
to avoid issues that can otherwise occur when there are two different
requirements attempting to achieve the same thing, but in different
places using different language.

The reason that it isn't really required (that things just work out if
we do nothing here at all) starts with XCU 2.3 bullet point 4:

        4. If the current character is <backslash>, single-quote,
           or double-quote and it is not quoted, it shall affect
           quoting for subsequent characters up to the end of the
           quoted text. The rules for quoting are as described in
           Section 2.2. During token recognition no substitutions
           shall be actually performed, and the result token shall
           contain exactly the characters that appear in the input
           (except for <newline> joining), unmodified, including any
           embedded or enclosing quotes or substitution operators,
           between the <quotation-mark> and the end of the quoted text.
                [...]

The important part of that for the present purpose, is 
                including any embedded or enclosing quotes

That is, when we tokenise:
                 *"${var}"
what we get as the text of the word token is exactly those characters
(all of them) unchanged.

It is often hard to remember that, as no implementation I know of,
actually works like that, and we tend to think more of the way our
most familiar implementation actually works, and what it needs to do
to implement things correctly.

In any case, since we are (obviously) "in a context where parameter
expansion occurs" (sorry, couldn't resist!) we expand the ${var}
substring of that, noting along the way that this occurs inside double
quotes.   The result of the expansion (which the double quotes in this
case do not affect) is .? so the result word, where the ${var} has been
replaced by .? is:      *".?"

If this is not obvious (if you'd expect that the "${var}" is what is
being expanded, and the quotes go away in the result) consider instead
the input word:
        *"*${var}?"
which will expand to
        *"*.??"
if done as I describe above without any issues.   But the other way, what?
We cannot go arbitrarily adding quote chars to make the original be
        *"*""${var}""?"
I don't think, even though the results should be identical.   Of course,
internally, an implementation can do that if it prefers (or do somethng
which is logically the equivalent of that).

Next, assuming we are in a context where field splitting occurs, we
do that on the results of the expansion, but since that is quoted,
field splitting (if we're doing it) changes nothing, so we can ignore
that one for now.

Next, if we're in a context where pathname expansion occurs (which we
are assuming already for this proposed text) we do pathname expansion
on the field that field splitting produced, which here is just the
original (parameter expanded) word
        *".?"
To pathname expansion (what it sees) this is identical to a case where
that same string had been produced as a word from the tokeniser, and there
never was a parameter to expand.

Pathname expansion sees the unquoted '*' and uses that to match
anything, and the two quoted following chars, which must match literally,
and it makes no difference at all whether or not the .? came from an
expansion (of any kind) or was there originally.   Pathname expansion
never depends upon anything like that.

Note that this is quite different from tilde expansion, where unquoted
input (the tilde prefix) gets replaced by (effectively) single quoted text,
so the results of the tilde expansion are not subject to field splitting,
parameter (etc) expansion, or pathname expansion.   That requires special
language added to make that happen - and also to make sure that this
"effectively quoted" gets removed again later.

The pathname/arith/command-sub expansions do not (arith obviously has
no relationship with pathname expansion, as it only produces digits,
and possibly a sign (+ or - ... maybe only - I forget) and none of those
characters is special for pathname expansion, so it really makes no
difference whether "$((1 + 1))" produces "2" (which it does) or just 2
when pathname expansion is being considered.

[This added in final re-reading, after spell checking, of this
message]:   Hmm, actually, that's not correct, if we have the word
        [2$((3-9))]
then the arith expansion results in
        [2-6]
which when pathname expansion happens, will match any files that
exist with the names 2 3 4 5 or 6
whereas if the original word had been
        [2"$((3-9))"]
then the expansion will result in
        [2"-6"]
which, when pathname expansion happens, will match any files that
exist with the names 2 - or 6 (as the quoted - is not the special
range speparator character).

So if we're doing something for parameter expansions, and command
substitution, then we need similar text for arithmetic expansions
as well ... and in that case, rather than writing similar text 3
times, we should probably put it in the prelude instead.   So I am
going to slighly revise the text I had as a suggested alternative
to your version (somewhat lower down...) to account for that.  In
fact, I think I'll include both versions.


Why this is complicated, is because we don't actually implement anything
like what the standard says (at least the NetBSD shell doesn't - the FreeBSD
shell doesn't either, but while they have a common heritage, and in many
respects, almost identical code, in this area the way the implementations
are now done has diverged so much that they are not even similar.

In our implementation, we do need to do exactly what you said - when we
expand a parameter, if it was in double quotes, we need to mark each
character in the expansion as having been quoted (because they're new
and were not marked as part of tokenisation) if they could possibly be
magic for pathname expansion ... we don't have to worry about field
splitting, as that is also done differently.  For that we simply record
(keep a list) of the beginning/end of each section of text that resulted
from an unquoted expansion, and scan the parts of each word that results
that are contained in that list for IFS chars, and nothing else.

But that is all just our implementation, and is all needed so we can
pretend what we're doing is what the standard says we should do, rather
than what we're actually doing.

The model promoted by the standard needs none of this, as it presumes that
the actual quote characters are still there, and that (apart from field
splitting for which it requires recording which pieces of the text came
from expansions, and which did not) so anything expanded inside double
quotes is still inside double quotes, as those remain until we do quote
removal (assuming we are in a context where quote removal occurs) which
is always the last thing that happens.

This is also why I added the following in my first attempt at (part of)
some replacement text for the $@ case ...

        If the '@' being expanded occurred within double quotes,
        then the expansion of each positional parameter is placed within
        double quotes in the fields generated.  These double quotes are
        treated as copies of the original pair, and are later deleted by
        quote removal.

And while that is not nearly adequate for what it really needs to say
to get this right, it is there, as in the standard's model of how
everything works, and as currently written, if we have the word

        aaa"bbb$@xxx"yyy

and we're doing parameter expansion, and $# = 3 -- 3 because 0 and 1 are
not interesting cases for the point I am about to make, 2 is, but does
not expose all of the problem, and 4 and more are just the same as 3, but
repeated (over and over for >4).

Let's assume that $# = 3 was obtained by

        set -- P Q R

According to how 2.5.2 for @ is written, what we get from this is

        aaa"bbbP Q Rxxx"yyy

(3 fields) as the part of the word preceding the $@ is joined before the
first field produced (the P from $1) and the part of the word (as the model
in the standard has it) before $@ is literally aaa"bbb ... and the part of
the word which follows $@ is joined, after, the final positional parameter
expanded ($3 in this case, ie: R) and in the model, that is xxx"yyy

How exactly we would do field splitting on that (which part is "within
quotes" as we look at each produced field, but it is kind of obvious that
the middle field (Q) is not quoted, so if it contains any pathname expansion
special chars (ie: * ? etc) then we would get pathname expansion from that.

That is not what really happens (anywhere) nor what should happen.

The solution to this isn't to note that because the param expansion occurred
in double quotes, we must specially escape any pathname patching chars that
are generated, as that doesn't fix the other problems with generating things
like this, instead, we need to make $@ in double quotes, expand (in standards
model style) to

        aaa"bbbP" "Q" "Rxxx"yyy

which is not exactly what the text I wrote would do (that quoted above) which
is why it is inadequate as written, but for the model we need to end up
with something like that, and we need to also ensure, that even though the
4 (new) embedded double quote chars) were produced as a result of doing
the expansion of $@, they are still subject to quote removal, even though
quote removal normally only deletes quotes that were part of the original
input, and not anything generated.   The same is true of the quotes (single
quotes this time, or possibly lots of \'s) generated around any text which
is the result of a tilde expansion - that quoting also needs to be removed
even though it wasn't original, but was generated as a result of an expansion.

We will need another issue about that.    Incidentally, apologies for all
this digression, but just now I am using these e-mails as something of a
dump of consciousness, so I (or anyone else who can access the messages from
the list archive) can verify that every issue that is mentioned either
becomes revealed as unimportant, and not needing anything done, or is correctly
handled in what we end up with, and nothing just gets forgotten.

All this illustrates a case where the standard's model makes things much
messier than the actual implementations - in an implementation like the
original Bourne shell's, where quoting was done by simply attacking a quote
mark to every quoted character, and deleting the quoting characters themselves,
all this is much simpler - when expanding anything that was quoted (expanding
quoted @ instead of unquoted @ for example) we simply copy the quote marker
to ever character expanded.   When expanding tilde (the Bourne sh did not,
of course, but it could have...) every char that results is just marked as
single quoted.   Field splitting can largely ignore quoting, as when
IFS is assigned, quote removal is done before the actual assignment to
the variable, so no variable can actually contain a quoted character.
That means that no quoted char (in a word) can possibly match a character
from IFS, and field splitting simply does the right thing without caring
about any of this.   Similarly, the unquoted * is a glob magic char,
the quoted form is not ...    Everything simply works with no complications
at all (and when "$@" is expanded with $# = 0, we simply end up with nothing
produced, so there is nothing there, nothing to carry a quote mark, and so
the end result of that is nothing - and why it was always considered a bug
in the Bourne shell that it actually produced "" - but that had nothing to
do with any quote marks, but its lake of any method to actually make a
word actually go away completely (to delete a field in modern language),
there was a word that contained the original "$@" (really double quoted $
and double quoted @) that expands to nothing, but the word remained, so it
was left with a word containing nothing, just the same as if it had been "".

  | So I think rather than deleting that text, it should be changed to
  | something like:
  |
  |      If a parameter expansion occurs inside double-quotes and the word
  |      expansions being performed include pathname expansion, all
  |      characters that result from the parameter expansion shall be
  |      treated as literal characters for the purposes of pathname
  |      expansion.

So, after all of that, I'd suggest that if we decide we need something
here at all - to make it more obvious to the reader what happens, it
should instead be more like

        If a parameter expansion occurs inside double-quotes in a context
        where pathname expansion occurs, note that the double-quotes remain
        in the results from the expansion until quote removal is later
        performed (in a context where quote removal occurs) and so any
        characters that result from the parameter expansion are treated
        as literal characters for the purposes of pathname expansion.

  | Note that there is similar text in 2.6.3 Command Substitution.

And the same (similar) there.

Or, as mentioned earlier, since arithmetic needs ths as well, instead
put this in the 2.6 prelude, and write it more like:

        When any of parameter expansion, command substitution, or arithmetic
        expansion is performed, and occurs inside double-quotes, in a context
        where pathname expansion occurs, note that the double-quotes remain
        along with the results from the expansion until quote removal is
        later performed (in a context where quote removal occurs) and so all
        characters that result from the expansion(s) are treated        as 
literal
        characters for the purposes of pathname expansion.

kre

ps: just for completeness, and as more of the dump of consciousness, it
is also worth remembering, that when a word like

        *".?"

is subject to pathname expansion, and assuming that there are exactly
two files in the appropriate directory that end in .?, say x.? and y.?
then the results from pathname expansion will be the two words (fields)

        x".?"  y".?"

the quotes don't participate in the pathname expansion, other than to
prevent either the . or the ? from being special chars, but they have
not yet been removed, and remain present until quote removal comes along
(the immediately following, and last, expansion) and removes them.

I don't recall any context where pathname expansion occurs which is not
also a context where quote removal occurs, and as quote removal occurs
immediately after pathname expansion, your average implementation is not
going to bother to retain those quotes (in any form) as they are just
going away anyway, there is nothing left for them to do - but when we're
working with the model of how all this works as defined in the standard,
we to need to be aware of what the model says should be happening in all
of these cases.

Re: pathname expansion and double-quotes (was: Another $@ expansion issue)

Reply via email to