Bug#758105: handling bytes not part of the charset, and other garbage

Paul Eggert Thu, 11 Sep 2014 09:28:13 -0700

Vincent Lefevre wrote:

There's no reason that '.' matches something that doesn't belong to
the charset in C locale, but doesn't match in a UTF-8 locale.

In the C locale on GNU/Linux, all byte values are members of thecharset. That is why it's OK for '.' to accept that byte in the Clocale but reject it in a UTF-8 locale.

It's annoying that now in UTF-8, one can no longer match ISO-8859-1 text

This has been true for quite some time in 'grep', at least with thestandard matchers. It may not have been true for -P but that relied onundefined behavior that could crash grep, and we can't have that.

It would make sense to add a notation to mean "match any character orinvalid byte", as an extension. That'd take some work, though.



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#758105: handling bytes not part of the charset, and other garbage

Reply via email to