James K. Lowden wrote:
> > 1.  "National" support.  COBOL programs define the runtime encoding and
> > collation of each string (sometimes implicitly).  COBOL defines two
> > encodings: "alphanumeric" and "national".  Every alphanumeric (and
> > national) variable and literal has a defined runtime encoding that is
> > distinct from the compile-time and runtime locale, and from the
> > encoding of the source code.  This means
> >
> >     MOVE 'foo' TO FOO.
> >
> > may involve iconv(3) and 
> >
> >     IF 'foo' = FOO
> >
> > is defined as true/false depending on the *characters* represented, not
> > their encoding.  That 'foo' could be CP1140 (single-byte EBCDIC) and
> > FOO could be UTF-16.  
> > ...
> > Conversion is a solved problem.  Comparison is not.

Comparison consists of two steps:
  1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or UTF-32,
     which one does not matter.)
  2) If a "closed world" assumption is valid:
       Compare the two Unicode strings.
     Otherwise:
       Convert the two Unicode strings to normalization form NFD, and
       compare the results.

By "closed world" I mean: Unicode text exchanged between programs
is typically assumed to be in Unicode normalization form NFC. See
https://www.unicode.org/faq/normalization.html#2 . If this assumption
holds, you don't need the normalization step above. Whereas if it
does not hold, for example, because the program can read arbitrary
text files, you need this normalization step.

Paul Koning wrote:
> Unicode comparison is addressed by the "stringprep" library.

Careful: "stringprep" does extra steps, which drop characters. See
https://datatracker.ietf.org/doc/html/rfc3454#section-3

> > 2) a limited amount
> > of Unicode evaluation is available in (IIRC) gnulib

Correct. The comparison without normalization is available in
libunistring as functions u8_cmp, u16_cmp, u32_cmp
https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html
or u8_strcmp, u16_strcmp, u32_strcmp:
https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html
Whereas the comparison with normalization is available as
functions u8_normcmp, u16_normcmp, u32_normcmp:
https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html

In Gnulib, each of these functions is available as a Gnulib module:
https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html
https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html
https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html

Jose Marchesi writes:
> It would be good to avoid duplicating that code though.

Especially as Unicode normalization is a rather complicated algorithm,
that includes data tables that change with every Unicode version.
If you duplicate that code, upgrades to newer Unicode versions (that
are released once a year) don't come for free. Whereas if you use
libunistring or Gnulib, they do come for free.

Bruno



Reply via email to