On Aug 20, 2009, at 02:35, Ruotger Skupin wrote:



Complex locale aware Unicode text queries can be slow. If you find yourself spending time with such a query, you should consider some of the techniques shown in the DerivedProperty example available on ADC.
Isn't all text Unicode?

No. Not all apps are Unicode based, and many of the ones that aren't will put things on the pasteboard quite happily. The web (and thus anything copied out of a web browser) is definitely not all Unicode, especially the older pages. And even within Unicode there are multiple encoding formats (8. 16, and 32 bit). In addition to the varying encoding sizes, Unicode also has multiple ways to represent conceptual characters. Characters that have diacritics for example, can be represented as either one Unichar ('é') or two ('´' + 'e').

I don't understand. This shouldn't be a special case. But I will have a look at the sample.

In my case I'd guess that at least half of the objects contain unicode strings (international names and addresses). What I want to say: write anything in German or French and you end up with Unicode.




Due to the multiplicity of representations, text comparisons in Unicode can be slow, since instead of just doing a byte by byte comparison, you end needing to calculate character sizes, check for compositions/decompositions, check for analogues between different symbol systems used to represent a single language (ie kana and kanjii), recognize and drop punctuation, etc. For apps that do repeated comparisons against a set of strings, it can be worth it to preprocess all strings into one canonical format to minimize the amount of work that needs to be done during a comparison (make all strings UTF8/16/32, make all characters lowercase, strip all diacritics or ensure characters that have them are always in either their composed or decomposed forms, etc) and then use a less expensive collation for the comparison.

As a side node, if you want to use regular expressions on Unicode strings, you generally need to do the normalization anyway, since regex languages operate at the Unichar level rather than at the conceptual character level.

+Melissa



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to