bug#32267: dd's ucase and lcase and LC_CTYPE.

Ralph Corderoy Wed, 25 Jul 2018 01:12:33 -0700

Hi,

Of dd(1), POSIX says


    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html
    lcase
        Map uppercase characters specified by the LC_CTYPE keyword
        tolower to the corresponding lowercase character.  Characters
        for which no mapping is specified shall not be modified by this
        conversion. 

and similarly for `ucase'.

But dd in coreutils 8.29-1 on Arch Linux just has a simple 256-byte
translation table that's mapped through tolower(3) or toupper(3).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/tolower.html
describes tolower(3) as handling only `unsigned char' or EOF, and being
the identity function on all values where there isn't a lowercase letter
for the uppercase value.

This deviation isn't documented AFAICS.  It means ASCII and ISO-8859-1
are re-cased just fine.  UTF-8 has its ASCII subset altered, and other
bytes left alone, so the end result is valid UTF-8, but not fully
re-cased.  But charmaps like /usr/share/i18n/charmaps/CP949.gz,
https://en.wikipedia.org/wiki/Unified_Hangul_Code, have variable-length
byte sequences where 0x41, for example, isn't always an ASCII `A' and
thus shouldn't become 0x61, `a'.

Aside from improving the documentation, actually fixing dd to match
POSIX will need to handle the re-cased character being a different
number of bytes; particularly noticeable if the output file is the input
file with `conv=notrunc'.

    $ locale | grep LC_CTYPE
    LC_CTYPE="en_GB.utf8"
    $
    $ sed 'l; s/./\u&/; l' <<<ȿ
    \310\277$
    \342\261\276$
    Ȿ
    $ sed 'l; s/./\l&/; l' <<<Ȿ
    \342\261\276$
    \310\277$
    ȿ
    $

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

bug#32267: dd's ucase and lcase and LC_CTYPE.

Reply via email to