> James K. Lowden wrote:
>> > 1. "National" support. COBOL programs define the runtime encoding and
>> > collation of each string (sometimes implicitly). COBOL defines two
>> > encodings: "alphanumeric" and "national". Every alphanumeric (and
>> > national) variable and literal has a defined runtime encoding that is
>> > distinct from the compile-time and runtime locale, and from the
>> > encoding of the source code. This means
>> >
>> > MOVE 'foo' TO FOO.
>> >
>> > may involve iconv(3) and
>> >
>> > IF 'foo' = FOO
>> >
>> > is defined as true/false depending on the *characters* represented, not
>> > their encoding. That 'foo' could be CP1140 (single-byte EBCDIC) and
>> > FOO could be UTF-16.
>> > ...
>> > Conversion is a solved problem. Comparison is not.
>
> Comparison consists of two steps:
> 1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or UTF-32,
> which one does not matter.)
> 2) If a "closed world" assumption is valid:
> Compare the two Unicode strings.
> Otherwise:
> Convert the two Unicode strings to normalization form NFD, and
> compare the results.
>
> By "closed world" I mean: Unicode text exchanged between programs
> is typically assumed to be in Unicode normalization form NFC. See
> https://www.unicode.org/faq/normalization.html#2 . If this assumption
> holds, you don't need the normalization step above. Whereas if it
> does not hold, for example, because the program can read arbitrary
> text files, you need this normalization step.
>
> Paul Koning wrote:
>> Unicode comparison is addressed by the "stringprep" library.
>
> Careful: "stringprep" does extra steps, which drop characters. See
> https://datatracker.ietf.org/doc/html/rfc3454#section-3
>
>> > 2) a limited amount
>> > of Unicode evaluation is available in (IIRC) gnulib
>
> Correct. The comparison without normalization is available in
> libunistring as functions u8_cmp, u16_cmp, u32_cmp
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
> or u8_strcmp, u16_strcmp, u32_strcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
> Whereas the comparison with normalization is available as
> functions u8_normcmp, u16_normcmp, u32_normcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html
>
> In Gnulib, each of these functions is available as a Gnulib module:
> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html
>
> Jose Marchesi writes:
>> It would be good to avoid duplicating that code though.
>
> Especially as Unicode normalization is a rather complicated algorithm,
> that includes data tables that change with every Unicode version.
> If you duplicate that code, upgrades to newer Unicode versions (that
> are released once a year) don't come for free. Whereas if you use
> libunistring or Gnulib, they do come for free.
Of all the libunistring functions I have copied in libga68, these are
the ones I had to adapt to support strides:
int _libga68_u32_cmp (const uint32_t *s1, size_t stride1,
const uint32_t *s2, size_t stride2,
size_t n);
int _libga68_u32_cmp2 (const uint32_t *s1, size_t n1, size_t stride1,
const uint32_t *s2, size_t n2, size_t stride2);
uint8_t *_libga68_u32_to_u8 (const uint32_t *s, size_t n, size_t stride,
uint8_t *resultbuf, size_t *lengthp);
Should I pursue a libunistring patch adding stride-aware extra
interfaces like these?