Bruno Haible wrote: > Pádraig Brady wrote: >> Note as well as folding case I think it might >> be useful to fold other forms like: >> Enclosed: \u24b6 -> A >> Stylistic: \uff21-> A > > These two transformations are already executed when you use ulc_casecmp > with the UNINORM_NFKD argument.
Ah right they're covered by compatibility equivalence: http://www.unicode.org/reports/tr15/ > >> Diacritics: À -> A > > Very good point. The case-insensitive comparisons are used in contexts > where different people enter the same word / name / term. But in these > context, additional transformations need to be done, depending on > culture. I think Google's front end to the search engine does these > transformations. They are: > - for French, to remove accents and diacritics, > - for German, to transform umlauts (ü -> ue), > - for Danish, probably to transform å -> aa, > - and certainly much more for other languages (what is it for Chinese)? > >> I.E. have more general function like: >> ulc_coll(fold={Case|Diactritics|Stylistic}, ...); > > _coll or _cmp ? _coll is used when people want to put lists of names in > order. The use case where diacritics are ignored is to do lookups, not for > sorting. sorry you're right, _cmp > Also, as mentioned above, I think which parts should be folded is locale > dependent. For French, it is ok to ignore diacritics when doing caseless > matching; for German, it is not. Well if the locale database stores this info (I don't think it does). Otherwise it would be left as an option to the user like: sort --fold={case,variants,diacritics,all} where "variants" corresponds to NFKD. cheers, Pádraig. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils