Re: [ast-developers] Re: Get ~(E) fixed or remove it

Irek Szczesniak Wed, 11 Jul 2012 02:48:55 -0700

On Wed, Jul 11, 2012 at 10:23 AM, Lionel Cons
<lionelcons1...@googlemail.com> wrote:
> On 1 July 2012 22:56, Lionel Cons <lionelcons1...@googlemail.com> wrote:
>> On 27 June 2012 19:24, Glenn Fowler <g...@research.att.com> wrote:
>>>
>>> On Wed, 27 Jun 2012 18:15:06 +0200 Roland Mainz wrote:
>>>> On Wed, Jun 27, 2012 at 6:04 PM, Glenn Fowler <g...@research.att.com> 
>>>> wrote:
>>>> > On Wed, 27 Jun 2012 17:43:06 +0200 Roland Mainz wrote:
>>>> >> How can I quote '-' in a ~(Ex)-style pattern [...] that it exactly
>>>> >> matches a '-' latter ?
>>>> >> I've tried the following pattern but the result is wrong (it should
>>>> >> match "hello-world" and "foo-bar"):
>>>> >> -- snip --
>>>> >> $ ~/bin/ksh -c 's="hello-world foo-bar" ;
>>>> >> dummy="${s//~(Ex)([_\-[:alnum:]]+)/D}" ; print -v .sh.match'
>>>> >> (
>>>> >>         (
>>>> >>                 hello
>>>> >>                 world
>>>> >>                 foo
>>>> >>                 bar
>>>> >>         )
>>>> >>         (
>>>> >>                 hello
>>>> >>                 world
>>>> >>                 foo
>>>> >>                 bar
>>>> >>         )
>>>> >> )
>>>> >> -- snip --
>>>> >> I tried to quote the '\' with a 2nd '\' without success (e.g. we get
>>>> >> the same wrong output/matches)
>>>> >> -- snip --
>>>> >> $ ~/bin/ksh -c 's="hello-world foo-bar" ;
>>>> >> dummy="${s//~(Ex)([_\-[:alnum:]]+)/D}" ; print -v .sh.match'
>>>> >> ...
>>>> >> -- snip --
>>>> >
>>>> >> Looking via dbx/gdb at the strings passed to the regex engine it looks
>>>> >> like ksh93 is either passing no '\' to |_ast_regcomp()| (in the case
>>>> >> of "~(Ex)([_\-[:alnum:]]+)") or it passes two '\' to |_ast_regcomp()|
>>>> >> (in the case of "~(Ex)([_\\-[:alnum:]]+)") ... it looks like a bug in
>>>> >> the ksh93 quoting mechanism for ~(E) patterns... ;-(
>>>> >
>>>> >> The only working workaround I found is to use \x<hex> to avoid having
>>>> >> to use \ to quote the '-' (the output below is IMO the expected one
>>>> >> for "${s//~(Ex)([_\-[:alnum:]]+)/D}"):
>>>> >> -- snip --
>>>> >> $ ~/bin/ksh -c 's="hello-world foo-bar" ;
>>>> >> dummy="${s//~(Ex)([_\x2d[:alnum:]]+)/D}" ; print -v .sh.match'
>>>> >> (
>>>> >>         (
>>>> >>                 hello-world
>>>> >>                 foo-bar
>>>> >>         )
>>>> >>         (
>>>> >>                 hello-world
>>>> >>                 foo-bar
>>>> >>         )
>>>> >> )
>>>> >> -- snip --
>>>> >
>>>> > its regex syntax and doesn't need a quote
>>>> > at http://pubs.opengroup.org/onlinepubs/9699919799/ set 9.3.5 item 7
>>>> > from that it looks like
>>>> > * if you want literal ']' use one of
>>>> >        []...]
>>>> >        [^]...]
>>>
>>>> I know...
>>>
>>>> > * if you want literal '-' place it last
>>>> >        [...-]
>>>
>>>> ... I didn't know that... ;-/
>>>> Thanks... :-)
>>>
>>>> ... but could you still check why ksh93 "swallows" the single '\' but
>>>> passes two '\' as "\\" to |_ast_regcomp()|, please ? Is this intended
>>>> or somehow a bug or sideeffect ?
>>>
>>> its a side effect or the conflict betwee ksh and regex quoting
>>> if a side has to win it will be ksh in that context
>>> dgk can give more detail on how tricky that part is because
>>> ksh can't be expected to know all of the intricacies of each ~(...) RE 
>>> syntax
>>> at some point when an RE gets complex enough it will have to be placed in a 
>>> var
>>> then referencing it as $the_re is guaranteed to get sh and RE quoting right
>>> (or at least pass what everquoting is present down to regex)
>>
>> I don't think this is going to be useful. Either ksh can be expected
>> to know all of the egrep syntax or knows nothing and passes the
>> pattern through unscathed after user has provided sufficient \ escapes
>> to prevent clashes with ksh syntax.
>> The current situation of "guessing" which side - ksh or ere - will win
>> is NOT acceptable.
>>
>> Try to see it from the point of a POSIX standardisation committee or a
>> code generator which will generate ksh93 code. The POSIX committee
>> won't accept a fuzzy situation as it is right now and a code generator
>> can't be expected to do a trial&error procedure like it is required
>> right now until a pattern fits the needs of ksh's guesswork.
>>
>> if the situation can't be improved then I'd suggest to remove the
>> whole ~(E) feature. While I see the very usefulness the current
>> implementation is completely unacceptable.
>
> So what will be done here? If nothing can be done I'll post a patch to
> wrap ~(E) support in SHOPT_EXPERIMENTAL_PATTERN_MATCHING so we can
> disable this on production machines.


Lets say, I'm not happy with the number of issues with ~(....) either
but please do not do such drastic steps.
I think the problem is:
1. Lack of clear rules how quoting in ~(....), especially ~(E), works
2. Lack of clear documentation of said rules
3. Lack of diagnostics, e.g. it is not possible for a script developer
to check how, and I consider neither gdb nor dbx options here, the
grep/egrep/xgrep/pcre pattern looks like when it is passed to
regcomp()

Irek
_______________________________________________
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Re: Get ~(E) fixed or remove it

Reply via email to