[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Amit Khandekar Thu, 01 Jun 2017 12:00:10 -0700

While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :


varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal.  Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}

But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.

For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)

The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
 ?column? | ?column?
----------+----------
 ´        | ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)

Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :

create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
----------
 t

Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
   result = strcmp(a1p, a2p);

... then the comparison using ICU collation tells they are identical strings :

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
----------
 f
(1 row)

postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
 ?column?
----------
 f
(1 row)

postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
 ?column?
----------
 t


Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Reply via email to