Re: bug in join: case comparisons don't work in multibyte locales

Bruno Haible Thu, 12 Mar 2009 04:39:37 -0700

Pádraig Brady wrote:
> Note as well as folding case I think it might
> be useful to fold other forms like:
>   Enclosed:  \u24b6 -> A
>   Stylistic: \uff21-> A


These two transformations are already executed when you use ulc_casecmp
with the UNINORM_NFKD argument.

>   Diacritics:  À -> A

Very good point. The case-insensitive comparisons are used in contexts
where different people enter the same word / name / term. But in these
context, additional transformations need to be done, depending on
culture. I think Google's front end to the search engine does these
transformations. They are:
  - for French, to remove accents and diacritics,
  - for German, to transform umlauts (ü -> ue),
  - for Danish, probably to transform å -> aa,
  - and certainly much more for other languages (what is it for Chinese)?

> I.E. have more general function like:
> ulc_coll(fold={Case|Diactritics|Stylistic}, ...);

_coll or _cmp ? _coll is used when people want to put lists of names in
order. The use case where diacritics are ignored is to do lookups, not for
sorting.

Also, as mentioned above, I think which parts should be folded is locale
dependent. For French, it is ok to ignore diacritics when doing caseless
matching; for German, it is not.

Bruno


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: bug in join: case comparisons don't work in multibyte locales

Reply via email to