On 4 Mar '08, at 8:55 PM, John Engelhart wrote:

It's sort of ambiguous if the /usr/lib/libicucore library is 'supported' or not. I believe the general consensus is that it's not really there for public use, hence the missing headers, but it's also not verboten.

Yeah, this is annoying. I don't know the reason for omitting the headers; Deborah Goldsmith would know (she's the ICU expert at Apple) but I don't know whether she reads this list.

The ICU Regex C API (the one I need to use for RegexKit, not the C++ one, which I haven't really looked at) is very multi-threading unfriendly. Basically, the 'compiled' regex, the string being matched, and the current match state are all wrapped up in the same opaque compiled regex pointer.

Well, I'm pretty multi-threading unfriendly myself, so that hasn't been a concern for me ;-) But seriously, IIRC there is a way to cheaply clone an ICU regex object, so you can compile it once and peel off a new copy for every string you need to match. (I wrote, but never finished, a Cocoa ICU wrapper before I left Apple, and I think that was my solution to the state problem.)

RegexKit spends considerable effort in trying to get access to the raw NSString buffer, to avoid unnecessary creation and destruction of temporary buffers to perform a match.

This is definitely a concern. I suspect this is the major reason there isn't an NSRegularExpression API yet; there's been talk of enhancing the ICU regex API to make it more flexible in how it accepts strings; but IMHO waiting for this is a case of "the best being the enemy of the good".

PCRE only works with UTF-8 encoded strings, while ICU only works in UTF-16. [...] most NSStrings buffers tend to be in a UTF-8 compatible form, allowing fast access by PCRE. Using ICU would require the creation of, and conversion to UTF-16 for most strings (again, usage dependent), only to be released/freed right after use.

I looked into this once. CFStrings (and NSStrings) are stored in one of two formats: (1) UTF-16, or (2) the "default C encoding". The latter varies by what your current locale is, but it defaults to ... MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are *never* stored in UTF-8 form, at least not in English-speaking locales. (On the other hand, CFString is fairly smart about encodings, so if the string is all-ascii, it realizes that's compatible with UTF-8 and can return the raw buffer if you ask for UTF-8.)

In my limited experiments, most strings I looked at were being stored in UTF-16. But it's heavily dependent on how the strings were created and what characters they contain, so YMMV.

For example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses RegexKit as its regex matching engine. This involves a list of about 500 regexes (depending on which adblock lists you've subscribed to) that need to be executed for every URL.

Um, can't you merge those together into a single regex by joining them together with "or" operators? (That's a fairly typical trick that lexers use.)

My zero-order approximation read on the ICU vs. PCRE on this issue leads me to think that they are essentially equal. However, PCRE and ICU define 'word' and 'non-word' (the regex escape sequence \w and \W), and consequently the '(non-)word break' (escape sequence \b and \B) very differently. Specifically, PCRE defines word and non- word in terms of ASCII encoding ONLY, whereas ICU does not

What you're saying is that they're essentially equal, except for non- ascii characters :)

ICU takes Unicode very, very seriously; that's its raison d'être. It's the International Components for Unicode. Regexes are just one of the things it does.

Translated to: A positive look-behind (the character just before this point in the regex) must be a Unicode Character and a positive look-ahead (the next character, without 'consuming' the input, must not be a unicode character). Definitely not as elegant, but I suspect passable.

Nope. As I said, several languages (including Japanese) have word- break rules that are more complex than this. Multiple words run together without any non-word characters in between. You have to use per-language heuristics to find the breaks. (My understanding is that Thai is especially nasty, practically requiring the use of a dictionary to tweeze apart the individual words.)

And as I said, this isn't just hypothetical. It became a Priority 1, stop-the-presses bug for my project in 2005 as soon as the Japanese testers started trying out the functionality that used PCRE and discovered that it didn't work.

—Jens

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to