> James K. Lowden wrote:
>> > 1.  "National" support.  COBOL programs define the runtime encoding and
>> > collation of each string (sometimes implicitly).  COBOL defines two
>> > encodings: "alphanumeric" and "national".  Every alphanumeric (and
>> > national) variable and literal has a defined runtime encoding that is
>> > distinct from the compile-time and runtime locale, and from the
>> > encoding of the source code.  This means
>> >
>> >    MOVE 'foo' TO FOO.
>> >
>> > may involve iconv(3) and 
>> >
>> >    IF 'foo' = FOO
>> >
>> > is defined as true/false depending on the *characters* represented, not
>> > their encoding.  That 'foo' could be CP1140 (single-byte EBCDIC) and
>> > FOO could be UTF-16.  
>> > ...
>> > Conversion is a solved problem.  Comparison is not.
>
> Comparison consists of two steps:
>   1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or UTF-32,
>      which one does not matter.)
>   2) If a "closed world" assumption is valid:
>        Compare the two Unicode strings.
>      Otherwise:
>        Convert the two Unicode strings to normalization form NFD, and
>        compare the results.
>
> By "closed world" I mean: Unicode text exchanged between programs
> is typically assumed to be in Unicode normalization form NFC. See
> https://www.unicode.org/faq/normalization.html#2 . If this assumption
> holds, you don't need the normalization step above. Whereas if it
> does not hold, for example, because the program can read arbitrary
> text files, you need this normalization step.
>
> Paul Koning wrote:
>> Unicode comparison is addressed by the "stringprep" library.
>
> Careful: "stringprep" does extra steps, which drop characters. See
> https://datatracker.ietf.org/doc/html/rfc3454#section-3
>
>> > 2) a limited amount
>> > of Unicode evaluation is available in (IIRC) gnulib
>
> Correct. The comparison without normalization is available in
> libunistring as functions u8_cmp, u16_cmp, u32_cmp
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
> or u8_strcmp, u16_strcmp, u32_strcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
> Whereas the comparison with normalization is available as
> functions u8_normcmp, u16_normcmp, u32_normcmp:
> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html
>
> In Gnulib, each of these functions is available as a Gnulib module:
> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html
>
> Jose Marchesi writes:
>> It would be good to avoid duplicating that code though.
>
> Especially as Unicode normalization is a rather complicated algorithm,
> that includes data tables that change with every Unicode version.
> If you duplicate that code, upgrades to newer Unicode versions (that
> are released once a year) don't come for free. Whereas if you use
> libunistring or Gnulib, they do come for free.

Of all the libunistring functions I have copied in libga68, these are
the ones I had to adapt to support strides:

  int _libga68_u32_cmp (const uint32_t *s1, size_t stride1,
                        const uint32_t *s2, size_t stride2,
                        size_t n);
  int _libga68_u32_cmp2 (const uint32_t *s1, size_t n1, size_t stride1,
                         const uint32_t *s2, size_t n2, size_t stride2);

  uint8_t *_libga68_u32_to_u8 (const uint32_t *s, size_t n, size_t stride,
                               uint8_t *resultbuf, size_t *lengthp);

Should I pursue a libunistring patch adding stride-aware extra
interfaces like these?

Reply via email to