Hi, > Norm wrote: > > I am not at all secure about how the standard GNU utilities will > > handle non-ascii characters. For example, 'wc -c', just counts > > bytes.
Christian has pointed out -c has remained bytes, --bytes is a synonym, because otherwise too many things would break, and that -m has been added to handle multi-byte characters, AKA --chars. tr(1) remains resolutely single bytes, though the documentation talks of growing multibyte support with a -C complement option. $ od -c <<<← 0000000 342 206 220 \n 0000004 $ $ tr \\220 \\221 <<<← ↑ $ Things like sed and grep all work in a UTF-8 world just fine, though often a bit more slowly, Unix having moved to it some years ago. $ sed 'y/\220/\221/' <<<← ← $ sed y/←/x/ <<<← x $ For the odd occasion when I want to remove locale specifics, I use ~/bin/C as a shorthand. $ cat ~/bin/C #! /bin/sh # LC_ALL has precedence over LANG according to POSIX[1], but we may as # well stamp out any traces by setting LANG too. # 1. The Open Group Base Specifications, Ch. 8 Environment Variables. LC_ALL=C LANG=C exec "$@" $ $ C sed 'y/←/x/' <<<← sed: -e expression #1, char 8: strings for `y' command are different lengths $ C sed 'y/←/xyz/' <<<← xyz $ Ken wrote: > But since UTF-8 has the excellent property that non-ASCII characters > look like just 8-bit characters but won't ever be mistaken for ASCII > (not a surprise, since it was designed by two of the original Unix > geeks) Ken Thompson and Rob Pike. (Pike's not quite original, but nearly.) Rob covered its creation in a diner on a napkin back in 2012. https://plus.google.com/+RobPikeTheHuman/posts/Rz1udTvtiMg There's a comment by me there with a Google Streetview of the diner. > I jumped whole-hog into UTF-8 a few years ago, and I haven't regretted > it one bit. No regrets here. You might find iconv(1) useful to convert existing files from one encoding to another. Cheers, Ralph. _______________________________________________ Nmh-workers mailing list Nmh-workers@nongnu.org https://lists.nongnu.org/mailman/listinfo/nmh-workers