Re: bug in join: case comparisons don't work in multibyte locales

Pádraig Brady Thu, 12 Mar 2009 05:07:56 -0700

Bruno Haible wrote:
> Pádraig Brady wrote:
>> Note as well as folding case I think it might
>> be useful to fold other forms like:
>>   Enclosed:  \u24b6 -> A
>>   Stylistic: \uff21-> A
> 
> These two transformations are already executed when you use ulc_casecmp
> with the UNINORM_NFKD argument.


Ah right they're covered by compatibility equivalence:
http://www.unicode.org/reports/tr15/

> 
>>   Diacritics:  À -> A
> 
> Very good point. The case-insensitive comparisons are used in contexts
> where different people enter the same word / name / term. But in these
> context, additional transformations need to be done, depending on
> culture. I think Google's front end to the search engine does these
> transformations. They are:
>   - for French, to remove accents and diacritics,
>   - for German, to transform umlauts (ü -> ue),
>   - for Danish, probably to transform å -> aa,
>   - and certainly much more for other languages (what is it for Chinese)?
> 
>> I.E. have more general function like:
>> ulc_coll(fold={Case|Diactritics|Stylistic}, ...);
> 
> _coll or _cmp ? _coll is used when people want to put lists of names in
> order. The use case where diacritics are ignored is to do lookups, not for
> sorting.

sorry you're right, _cmp

> Also, as mentioned above, I think which parts should be folded is locale
> dependent. For French, it is ok to ignore diacritics when doing caseless
> matching; for German, it is not.

Well if the locale database stores this info (I don't think it does).
Otherwise it would be left as an option to the user like:
sort --fold={case,variants,diacritics,all} where "variants" corresponds to NFKD.

cheers,
Pádraig.


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: bug in join: case comparisons don't work in multibyte locales

Reply via email to