On Fri, Sep 26, 2025 at 6:46 PM Robert Haas <[email protected]> wrote:
> Generally, there are two methods for performing collation-aware string > comparisons, and they are expected to produce equivalent results. For > libc collations, the first method is to use strcoll() or strcoll_l() > or similar to directly compared the strings, and the other is to use > strxfrm() or similar to convert the string to a normalize > representation which can then be compared using memcmp(). Equivalent > facilities exist for ICU; see collate_methods_icu and > collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those > should be getting used here. > > Even though these two methods are expected to produce equivalent > results, if there's a bug, they might not. That bug could either exist > in our code or it could exist in ICU; I suspect the latter is more > likely, but the former is certainly possible. What I think you want to > do is try to track down two specific strings where transforming them > via strnxfrm_icu() produces results that compare in one order using > memcmp(), but passing the same strings to strncoll_icu() or > strncoll_icu_utf8() -- whichever is appropriate -- produces a > different result. If you're able to find such strings, then you can > probably rephrase those examples in terms of whatever underlying > functions strncoll_icu() and strxfrm_icu() are calling and we can > maybe report the problem to the ICU maintainers. If no such strings > seem to exist, then there's probably a problem in the PostgreSQL code > someplace, and you should try to figure out why we're getting the > wrong answer despite ICU apparently doing the right thing. > > -- > Robert Haas > EDB: http://www.enterprisedb.com Thanks Robert for your reply! I was able to create an independent C src file. With no relation to PG in it. Did a simple comparison of the two methods mentioned by Robert for '1' & 'a' as input strings. The ASCENDING SORT ORDER of them is printed. Here is how I compiled the C src code:- "gcc icu_issue_repro.c -L/usr/lib -licui18n -licuuc -licudata -o icu_issue_repro.o" Here is its output:- ------------------------------------------------------------------------------------------ Testing sort order for '1' & 'a' using ICU library with collation = 'ja-u-kr-latn-digit' With Method ucol_strcollUTF8(): SORT ORDER ASC : '1', 'a' With Method ucol_nextSortKeyPart() (i.e transform and memcmp): SORT ORDER ASC : 'a', '1' Testing Ends ------------------------------------------------------------------------------------------ We can clearly see that both methods are showing different sort order for the same input strings. So, I think this helps confirm that the problem lies in the ICU library method "ucol_strcollUTF8()". Robert, if you agree with my shared details, can you please help me with details on how I can share this issue to ICU maintainers for the correction? Thank you! PFA, the C src code. Regards, Nishant Sharma, EDB, Pune.
icu_issue_repro.c
Description: Binary data
