[ast-developers] ([^[><]]+)+? triggers single-byte interpretation of matches (instead of single-(multibyte-)character) ...

Roland Mainz Wed, 20 Jun 2012 17:27:27 -0700

Hi!

----


Here is another issue related to using ([^[><]]+)+? in an egrep pattern.

Running the following example with ast-ksh.2012-06-12 in the
en_US.UTF-8 locale on Solaris 11/AMD64 prints single-byte values with
the 7th bit set (e.g. illegal in UTF-8 ; and if you look closer the
final "." of the input string gets missing, too):
-- snip --
$ ksh -c $'s="bye bye \u[20ac]." ;
dummy="${s//~(E)(?:([^[><]]+)+?)/dummy}" ; print -v .sh.match'
(
        (
                b
                y
                e
                ' '
                b
                y
                e
                ' '
                ??GARBAGE??
                ??GARGABE??
                ??GARBAGE??
        )
        (
                b
                y
                e
                ' '
                b
                y
                e
                ' '
                ??GARBAGE??
                ??GARGABE??
                ??GARBAGE??
        )
)
-- snip --

I've replaced the invalid byte sequences with the text "??GARGABE??"
here since not all email applications will view the issue.

----

Bye,
Roland

P.S.: Technically these are two bugs: 1. ([^[><]]+)+? triggers
single-byte interpretation and 2. that print -v .sh.match doesn't put
the single-byte values into something like $'\xFF' ...

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers

[ast-developers] ([^[><]]+)+? triggers single-byte interpretation of matches (instead of single-(multibyte-)character) ...

Reply via email to