Date: Fri, 1 Mar 2019 16:05:07 +0000 From: Geoff Clare <g...@opengroup.org> Message-ID: <20190301160507.GA26998@lt2.masqnet>
Again, replying more or less backwards: | I think these issues with 2.6.2 and 2.6.3 should be handled separately | from the $@ one (hence the change of subject line). That's fine, and a good idea ... and I suspect that in any case, when we're done, there may end up being several different, but related, issues added to mantis (another, which is a consequence of what we have been discussing, but logically separate, I will mention just below, as it is relevant to this point as well, and yet another, much further down.) | So I think rather than deleting that text, it should be changed to | something like: | | If a parameter expansion occurs inside double-quotes and the word | expansions being performed include pathname expansion, The second half of that I would write as "and is in a context where pathname expansion occurs" This is the first probable extra issue I just referred to. I'd like to add something like the following in the prelude of 2.6. (Some of this, perhaps even most of this, is already there, and I haven't yet sought to integrate what is new here with what already exists, so don't treat the following as planned text, just an idea to be done correctly later.) The expansions defined in this section take place when, and only when, elsewhere in this standard specifies that some or all of the word expansions occur. That may be done by listing some subset of these expansions which are to occur, or simply by referring to them all, for example "the words are expanded" or "the expanded word" in which case all of the expansions defined here occur. In other cases, only the specified expansions occur. Because the results of some expansions can vary depending upon which other expansions will be, or have been, performed, this standard uses the phrase "in a context where <i>expansion</i> occurs", or its negated form, which shall be interpreted to mean that the if the named <i>expansion</i> was, or will be, performed (or not performed in the negated version) by explicitly, or implicitly by specifying all, selected by the section of this standard which specifies that the word expansion occurs, then the condition is true, and will cause whatever other processing is said to occur, or not occur, in that case. When word expansions occur, and regardless of how they are listed, including in which order, the expansions which are to occur, always occur in the order defined in this section, with any expansions that are not to occur simply omitted. All that is really trying to add, which is not already there, I think, is a definition of what "in a context where xxx occurs" means, and an explicit statement that expansions only ever happen when something in the standard says they do. As usual, my actual wording will need fixing by someone else (once it really appears ... what is above will certainly not be it!) And actually (more "added later" words here) we probably also need to say that whenever the standard requires parameter expansion, but not command substitution or arithmetic expansion (eg: expanding ENV or PS1) the implementation is permitted to perform whichever of those 3 was not specified by the standard as an extension. That is, a user cannot complain that their carefully planned PS1='$((1 + 1)) ' is printed as "2 " by some shell, rather than the "$((1 + 1)) " they were expecting, because the standard only specifies parameter expansion, not arithmetic (though for PS1 arith might have been added, but I don't think that has happened for ENV). I think that specifying this option for the implementation in the 2.6 prelude makes more sense than specifying it over and over again in the remainder of the standard. | all | characters that result from the parameter expansion shall be | treated as literal characters for the purposes of pathname | expansion. for the rest of that,, and a suggested other way of saying it, we first need to go back to an earlier part of your message: | Having thought some more about this part, I think what the first bullet | is trying to say is the point you made when you said, for var=.? : | | What the quotes did, was not to prevent pathname expansion from | happening, but to prevent the '?' being interpreted as "match any | character" but instead be "match a question mark". Yes, I agree that is what it thinks it needs to do. But: | Without a statement at this point, there is nothing in the standard | that says the '?' must be treated as a literal character. There is | text that says for: | | echo *".?" | | the '?' is a literal character (2.2.3 Double-Quotes), but that text | doesn't apply when the '?' results from the expansion rather than | being directly double-quoted. I disagree with that. The reasoning is somewhat convoluted, but it is actually already specified. Because it is somewhat roundabout, this may end up being a case where we want to add some text for clarification but if we do, I think it ought to be descriptive, rather than prescriptive, to avoid issues that can otherwise occur when there are two different requirements attempting to achieve the same thing, but in different places using different language. The reason that it isn't really required (that things just work out if we do nothing here at all) starts with XCU 2.3 bullet point 4: 4. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2. During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotation-mark> and the end of the quoted text. [...] The important part of that for the present purpose, is including any embedded or enclosing quotes That is, when we tokenise: *"${var}" what we get as the text of the word token is exactly those characters (all of them) unchanged. It is often hard to remember that, as no implementation I know of, actually works like that, and we tend to think more of the way our most familiar implementation actually works, and what it needs to do to implement things correctly. In any case, since we are (obviously) "in a context where parameter expansion occurs" (sorry, couldn't resist!) we expand the ${var} substring of that, noting along the way that this occurs inside double quotes. The result of the expansion (which the double quotes in this case do not affect) is .? so the result word, where the ${var} has been replaced by .? is: *".?" If this is not obvious (if you'd expect that the "${var}" is what is being expanded, and the quotes go away in the result) consider instead the input word: *"*${var}?" which will expand to *"*.??" if done as I describe above without any issues. But the other way, what? We cannot go arbitrarily adding quote chars to make the original be *"*""${var}""?" I don't think, even though the results should be identical. Of course, internally, an implementation can do that if it prefers (or do somethng which is logically the equivalent of that). Next, assuming we are in a context where field splitting occurs, we do that on the results of the expansion, but since that is quoted, field splitting (if we're doing it) changes nothing, so we can ignore that one for now. Next, if we're in a context where pathname expansion occurs (which we are assuming already for this proposed text) we do pathname expansion on the field that field splitting produced, which here is just the original (parameter expanded) word *".?" To pathname expansion (what it sees) this is identical to a case where that same string had been produced as a word from the tokeniser, and there never was a parameter to expand. Pathname expansion sees the unquoted '*' and uses that to match anything, and the two quoted following chars, which must match literally, and it makes no difference at all whether or not the .? came from an expansion (of any kind) or was there originally. Pathname expansion never depends upon anything like that. Note that this is quite different from tilde expansion, where unquoted input (the tilde prefix) gets replaced by (effectively) single quoted text, so the results of the tilde expansion are not subject to field splitting, parameter (etc) expansion, or pathname expansion. That requires special language added to make that happen - and also to make sure that this "effectively quoted" gets removed again later. The pathname/arith/command-sub expansions do not (arith obviously has no relationship with pathname expansion, as it only produces digits, and possibly a sign (+ or - ... maybe only - I forget) and none of those characters is special for pathname expansion, so it really makes no difference whether "$((1 + 1))" produces "2" (which it does) or just 2 when pathname expansion is being considered. [This added in final re-reading, after spell checking, of this message]: Hmm, actually, that's not correct, if we have the word [2$((3-9))] then the arith expansion results in [2-6] which when pathname expansion happens, will match any files that exist with the names 2 3 4 5 or 6 whereas if the original word had been [2"$((3-9))"] then the expansion will result in [2"-6"] which, when pathname expansion happens, will match any files that exist with the names 2 - or 6 (as the quoted - is not the special range speparator character). So if we're doing something for parameter expansions, and command substitution, then we need similar text for arithmetic expansions as well ... and in that case, rather than writing similar text 3 times, we should probably put it in the prelude instead. So I am going to slighly revise the text I had as a suggested alternative to your version (somewhat lower down...) to account for that. In fact, I think I'll include both versions. Why this is complicated, is because we don't actually implement anything like what the standard says (at least the NetBSD shell doesn't - the FreeBSD shell doesn't either, but while they have a common heritage, and in many respects, almost identical code, in this area the way the implementations are now done has diverged so much that they are not even similar. In our implementation, we do need to do exactly what you said - when we expand a parameter, if it was in double quotes, we need to mark each character in the expansion as having been quoted (because they're new and were not marked as part of tokenisation) if they could possibly be magic for pathname expansion ... we don't have to worry about field splitting, as that is also done differently. For that we simply record (keep a list) of the beginning/end of each section of text that resulted from an unquoted expansion, and scan the parts of each word that results that are contained in that list for IFS chars, and nothing else. But that is all just our implementation, and is all needed so we can pretend what we're doing is what the standard says we should do, rather than what we're actually doing. The model promoted by the standard needs none of this, as it presumes that the actual quote characters are still there, and that (apart from field splitting for which it requires recording which pieces of the text came from expansions, and which did not) so anything expanded inside double quotes is still inside double quotes, as those remain until we do quote removal (assuming we are in a context where quote removal occurs) which is always the last thing that happens. This is also why I added the following in my first attempt at (part of) some replacement text for the $@ case ... If the '@' being expanded occurred within double quotes, then the expansion of each positional parameter is placed within double quotes in the fields generated. These double quotes are treated as copies of the original pair, and are later deleted by quote removal. And while that is not nearly adequate for what it really needs to say to get this right, it is there, as in the standard's model of how everything works, and as currently written, if we have the word aaa"bbb$@xxx"yyy and we're doing parameter expansion, and $# = 3 -- 3 because 0 and 1 are not interesting cases for the point I am about to make, 2 is, but does not expose all of the problem, and 4 and more are just the same as 3, but repeated (over and over for >4). Let's assume that $# = 3 was obtained by set -- P Q R According to how 2.5.2 for @ is written, what we get from this is aaa"bbbP Q Rxxx"yyy (3 fields) as the part of the word preceding the $@ is joined before the first field produced (the P from $1) and the part of the word (as the model in the standard has it) before $@ is literally aaa"bbb ... and the part of the word which follows $@ is joined, after, the final positional parameter expanded ($3 in this case, ie: R) and in the model, that is xxx"yyy How exactly we would do field splitting on that (which part is "within quotes" as we look at each produced field, but it is kind of obvious that the middle field (Q) is not quoted, so if it contains any pathname expansion special chars (ie: * ? etc) then we would get pathname expansion from that. That is not what really happens (anywhere) nor what should happen. The solution to this isn't to note that because the param expansion occurred in double quotes, we must specially escape any pathname patching chars that are generated, as that doesn't fix the other problems with generating things like this, instead, we need to make $@ in double quotes, expand (in standards model style) to aaa"bbbP" "Q" "Rxxx"yyy which is not exactly what the text I wrote would do (that quoted above) which is why it is inadequate as written, but for the model we need to end up with something like that, and we need to also ensure, that even though the 4 (new) embedded double quote chars) were produced as a result of doing the expansion of $@, they are still subject to quote removal, even though quote removal normally only deletes quotes that were part of the original input, and not anything generated. The same is true of the quotes (single quotes this time, or possibly lots of \'s) generated around any text which is the result of a tilde expansion - that quoting also needs to be removed even though it wasn't original, but was generated as a result of an expansion. We will need another issue about that. Incidentally, apologies for all this digression, but just now I am using these e-mails as something of a dump of consciousness, so I (or anyone else who can access the messages from the list archive) can verify that every issue that is mentioned either becomes revealed as unimportant, and not needing anything done, or is correctly handled in what we end up with, and nothing just gets forgotten. All this illustrates a case where the standard's model makes things much messier than the actual implementations - in an implementation like the original Bourne shell's, where quoting was done by simply attacking a quote mark to every quoted character, and deleting the quoting characters themselves, all this is much simpler - when expanding anything that was quoted (expanding quoted @ instead of unquoted @ for example) we simply copy the quote marker to ever character expanded. When expanding tilde (the Bourne sh did not, of course, but it could have...) every char that results is just marked as single quoted. Field splitting can largely ignore quoting, as when IFS is assigned, quote removal is done before the actual assignment to the variable, so no variable can actually contain a quoted character. That means that no quoted char (in a word) can possibly match a character from IFS, and field splitting simply does the right thing without caring about any of this. Similarly, the unquoted * is a glob magic char, the quoted form is not ... Everything simply works with no complications at all (and when "$@" is expanded with $# = 0, we simply end up with nothing produced, so there is nothing there, nothing to carry a quote mark, and so the end result of that is nothing - and why it was always considered a bug in the Bourne shell that it actually produced "" - but that had nothing to do with any quote marks, but its lake of any method to actually make a word actually go away completely (to delete a field in modern language), there was a word that contained the original "$@" (really double quoted $ and double quoted @) that expands to nothing, but the word remained, so it was left with a word containing nothing, just the same as if it had been "". | So I think rather than deleting that text, it should be changed to | something like: | | If a parameter expansion occurs inside double-quotes and the word | expansions being performed include pathname expansion, all | characters that result from the parameter expansion shall be | treated as literal characters for the purposes of pathname | expansion. So, after all of that, I'd suggest that if we decide we need something here at all - to make it more obvious to the reader what happens, it should instead be more like If a parameter expansion occurs inside double-quotes in a context where pathname expansion occurs, note that the double-quotes remain in the results from the expansion until quote removal is later performed (in a context where quote removal occurs) and so any characters that result from the parameter expansion are treated as literal characters for the purposes of pathname expansion. | Note that there is similar text in 2.6.3 Command Substitution. And the same (similar) there. Or, as mentioned earlier, since arithmetic needs ths as well, instead put this in the 2.6 prelude, and write it more like: When any of parameter expansion, command substitution, or arithmetic expansion is performed, and occurs inside double-quotes, in a context where pathname expansion occurs, note that the double-quotes remain along with the results from the expansion until quote removal is later performed (in a context where quote removal occurs) and so all characters that result from the expansion(s) are treated as literal characters for the purposes of pathname expansion. kre ps: just for completeness, and as more of the dump of consciousness, it is also worth remembering, that when a word like *".?" is subject to pathname expansion, and assuming that there are exactly two files in the appropriate directory that end in .?, say x.? and y.? then the results from pathname expansion will be the two words (fields) x".?" y".?" the quotes don't participate in the pathname expansion, other than to prevent either the . or the ? from being special chars, but they have not yet been removed, and remain present until quote removal comes along (the immediately following, and last, expansion) and removes them. I don't recall any context where pathname expansion occurs which is not also a context where quote removal occurs, and as quote removal occurs immediately after pathname expansion, your average implementation is not going to bother to retain those quotes (in any form) as they are just going away anyway, there is nothing left for them to do - but when we're working with the model of how all this works as defined in the standard, we to need to be aware of what the model says should be happening in all of these cases.