>> James K. Lowden wrote:
>>> > 1.  "National" support.  COBOL programs define the runtime encoding and
>>> > collation of each string (sometimes implicitly).  COBOL defines two
>>> > encodings: "alphanumeric" and "national".  Every alphanumeric (and
>>> > national) variable and literal has a defined runtime encoding that is
>>> > distinct from the compile-time and runtime locale, and from the
>>> > encoding of the source code.  This means
>>> >
>>> >   MOVE 'foo' TO FOO.
>>> >
>>> > may involve iconv(3) and 
>>> >
>>> >   IF 'foo' = FOO
>>> >
>>> > is defined as true/false depending on the *characters* represented, not
>>> > their encoding.  That 'foo' could be CP1140 (single-byte EBCDIC) and
>>> > FOO could be UTF-16.  
>>> > ...
>>> > Conversion is a solved problem.  Comparison is not.
>>
>> Comparison consists of two steps:
>>   1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or UTF-32,
>>      which one does not matter.)
>>   2) If a "closed world" assumption is valid:
>>        Compare the two Unicode strings.
>>      Otherwise:
>>        Convert the two Unicode strings to normalization form NFD, and
>>        compare the results.
>>
>> By "closed world" I mean: Unicode text exchanged between programs
>> is typically assumed to be in Unicode normalization form NFC. See
>> https://www.unicode.org/faq/normalization.html#2 . If this assumption
>> holds, you don't need the normalization step above. Whereas if it
>> does not hold, for example, because the program can read arbitrary
>> text files, you need this normalization step.
>>
>> Paul Koning wrote:
>>> Unicode comparison is addressed by the "stringprep" library.
>>
>> Careful: "stringprep" does extra steps, which drop characters. See
>> https://datatracker.ietf.org/doc/html/rfc3454#section-3
>>
>>> > 2) a limited amount
>>> > of Unicode evaluation is available in (IIRC) gnulib
>>
>> Correct. The comparison without normalization is available in
>> libunistring as functions u8_cmp, u16_cmp, u32_cmp
>> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
>> or u8_strcmp, u16_strcmp, u32_strcmp:
>> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
>> Whereas the comparison with normalization is available as
>> functions u8_normcmp, u16_normcmp, u32_normcmp:
>> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html
>>
>> In Gnulib, each of these functions is available as a Gnulib module:
>> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
>> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
>> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html
>>
>> Jose Marchesi writes:
>>> It would be good to avoid duplicating that code though.
>>
>> Especially as Unicode normalization is a rather complicated algorithm,
>> that includes data tables that change with every Unicode version.
>> If you duplicate that code, upgrades to newer Unicode versions (that
>> are released once a year) don't come for free. Whereas if you use
>> libunistring or Gnulib, they do come for free.
>
> Of all the libunistring functions I have copied in libga68, these are
> the ones I had to adapt to support strides:
>
>   int _libga68_u32_cmp (const uint32_t *s1, size_t stride1,
>                         const uint32_t *s2, size_t stride2,
>                         size_t n);
>   int _libga68_u32_cmp2 (const uint32_t *s1, size_t n1, size_t stride1,
>                          const uint32_t *s2, size_t n2, size_t stride2);
>
>   uint8_t *_libga68_u32_to_u8 (const uint32_t *s, size_t n, size_t stride,
>                                uint8_t *resultbuf, size_t *lengthp);
>
> Should I pursue a libunistring patch adding stride-aware extra
> interfaces like these?

Never mind, Bruno pointed out in another email that such stride-aware
interfaces would be too specialized for the general purpose library.

Reply via email to