I had focused on the other part of this thread and forgotten about this part you are correct that [^...] triggers single byte mode the [...] parser has an optimization for single byte mode, impelmented by a 256 byte table, which is preferable over the multibyte mode, which is implemented by callout functions the optimizer should have selected multibyte mode for ^ and multibyte locale but didn't -- its fixed now thanks
On Thu, 21 Jun 2012 02:26:49 +0200 Roland Mainz wrote: > Here is another issue related to using ([^[><]]+)+? in an egrep pattern. > Running the following example with ast-ksh.2012-06-12 in the > en_US.UTF-8 locale on Solaris 11/AMD64 prints single-byte values with > the 7th bit set (e.g. illegal in UTF-8 ; and if you look closer the > final "." of the input string gets missing, too): > -- snip -- > $ ksh -c $'s="bye bye \u[20ac]." ; > dummy="${s//~(E)(?:([^[><]]+)+?)/dummy}" ; print -v .sh.match' > ( > ( > b > y > e > ' ' > b > y > e > ' ' > ??GARBAGE?? > ??GARGABE?? > ??GARBAGE?? > ) > ( > b > y > e > ' ' > b > y > e > ' ' > ??GARBAGE?? > ??GARGABE?? > ??GARBAGE?? > ) > ) > -- snip -- > I've replaced the invalid byte sequences with the text "??GARGABE??" > here since not all email applications will view the issue. > ---- > Bye, > Roland > P.S.: Technically these are two bugs: 1. ([^[><]]+)+? triggers > single-byte interpretation and 2. that print -v .sh.match doesn't put > the single-byte values into something like $'\xFF' ... > -- > __ . . __ > (o.\ \/ /.o) roland.ma...@nrubsig.org > \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer > /O /==\ O\ TEL +49 641 3992797 > (;O/ \/ \O;) _______________________________________________ ast-developers mailing list ast-developers@research.att.com https://mailman.research.att.com/mailman/listinfo/ast-developers