On Thu, Jul 30, 2009 at 8:04 PM, BareFeet <list.develo...@tandb.com.au>wrote:
> Hi John and all, > > You might want to look at AGRegex which is very compact (one class) and >>> which uses PCRE: >>> >>> http://colloquy.info/project/browser/trunk/Frameworks/AGRegex >>> >>> >> Of note, Colloquy appears to have switched to RegexKitLite itself: >> >> http://svn.colloquy.info/project/changeset/4301 > > Just to be clear, I'm the author of RegexKitLite (and RegexKit.framework). I just like to be up front about that so you can apply whatever amount of bias filtering you want to any claims or statements I make. > <http://svn.colloquy.info/project/changeset/4301> > > I did notice that log entry, but thought it was never acted upon (ie they > are still using AGRegex). I can't say I did any kind of exhaustive check, but I was under the impression that they had definitely switched over. I even got a bug report from them. > > RegexKitLite looks promising. It claims to only require you to add the .h > and .m file to your project and link to the libicucore.dylib library. > > The documentation notes: "Warning: Apple does not officially support > linking to the libicucore.dylib library." In reality, how worried should I > be about this? I am amazed that Cocoa doesn't provide regex itself. Surely > Apple must provide or recommend something to do the job. wrt/ to linking to libicucore.dylib, that's kind of a grey area. I try to be as up front as possible about that fact in the documentation. What follows is my opinion and carries no official weight. So far as I know it's an accurate representation of the facts, and I've tried to keep it objective: The shared library that causes the controversy is /usr/lib/libicucore.dylib. I've searched the documentation and I could find nothing that explicitly forbids linking against it, or anything else in /usr/lib. If one subscribes to the common unix traditions, the /usr/lib directory is generally considered "fair game" for linking against- it is one of the common locations for a systems publicly available shared libraries. By placing a library in /usr/lib, one implicitly declares it "publicly available". The next stumbling block is the need for headers. A default install of Mac OS X does not include the ICU headers one would normally need to make use of the ICU library. However, the ICU project is an open source project, so one can (easily?) assemble a suitable set of headers if one is so inclined. Not only that, but Apple provides a tar ball of their branch of ICU that is used to build the binaries that are present on every Mac OS X system. Furthermore, that tar balls make file includes a target to install the ICU headers on your system. Although a bit convoluted to actually get, Apple does publicly provide the headers for the ICU library. See http://www.opensource.apple.com/tarballs/ICU/ for the tar balls. After that, the next criteria is whether or not the API is documented. It's safe to say that the ICU API is documented, although not by Apple. Apple actually refers to the ICU documentation in certain parts of its official documentation (NSPredicate wrt/ regular expressions and the MATCHES operator). So, it comes down to a matter of opinion and a judgement call. Considering how easy it is to create a location in the file system that makes it clear that the shared libraries within are private, I'm of the opinion that the /usr/lib/libicucore.dylib file is definitely in the public category. Even private frameworks have their own slice in /System/Library/PrivateFrameworks, which makes it pretty clear that the contents within are off-limits. Even within public frameworks their is the PrivateHeaders folder for non-public API information. Up next is whether or not the lack of headers makes the library "private". If this was a proprietary library, I'd probably lean towards "makes it private". However, it's a publicly available open-source project, so it becomes a little more grey. The fact that Apple publicly provides everything needed to build an exact copy of the version of ICU that's shipped with system, and the ability to install the headers makes it really grey. Personally, I'm inclined to say that it's in the "not private" category. I think it's fair to say that the "undocumented API" clause doesn't apply. Finally, I'm not aware of any official decrees that explicitly make /usr/lib/libicucore.dylib a "private API". What advice that has come from Apple has been extremely ambiguous, usually with a caveat along the lines of "this may not be officially supported". >From a purely pragmatic perspective, it makes a lot of sense for Apple to provide the headers and make it an "Official, Public API". First and foremost is consistency for applications- it removes the need for every developer to duplicate the work that's already been done, and fill their .app/ distribution with yet another copy of a (rather large) shared library. Another big plus is that from a security point of view- if a problem is found in the ICU library, Apple can provide an updated shared library with the fix and every single application that links against it is automatically 'patched'. That's a fairly compelling reason all by itself. Moving on to why Apple doesn't provide this functionality, well. I don't work for Apple, so this is nothing but raw speculation based on snippets of public posting. It's my understanding that one of the big stumbling blocks has been the fact that the ICU regex engine can only match text that is encoded as UTF-16. NSStrings (or, more correctly, CFString) keeps its internal (normal warnings about the internal, private details of an objects implementation apply) buffer of a strings contents in either an 8-bit format or UTF-16. The 8-bit format is normally MacOSRoman, which is a superset of ASCII. An awful lot of strings can be encoded as MacOSRoman, and takes up half the space of its UTF-16 equivalent (1 byte per character vs nominally 2 bytes per character). Soo, there's a bit of an impedance mismatch since the ideal situation would be that the ICU regex engine be encoding agnostic wrt/ to the text it's searching. RegexKitLite dodges this bullet by keeping a cache of the most recent UTF-16 conversion for a string, if one was even needed (if a strings backing buffer is already UTF-16 encoded, it just uses that directly). This is works out well for the majority of usage cases. The usual caveats regarding caching apply: Caching works by exploiting temporal locality of typical usage patterns- usage patterns that exceed the "working set" capacity of a cache can cause a dramatic drop in performance. > > As quoted earlier: > > Unfortunately, RegexKit Lite (the stripped-down version) uses the built-in >>> ICU library which uses a syntax quite different to the PCRE that most people >>> are used to. >>> >> > At first glance through the "ICU Syntax" documentation included with > RegexKitLite, it appears the same as what I'm used to. At least it supports > \s for whitespace, \w for words, (?=...) for look ahead. I did, however, > discover: > > Single Quote >> Two single quotes represent a single quote, either inside or outside >> single quotes. Text within single quotes is not interpreted in any way, >> except for two adjacent single quotes. It is taken as literal text— special >> characters become non-special. These quoting conventions for ICU character >> classes differ from those of Perl or Java. In those environments, single >> quotes have no special meaning, and are treated like any other literal >> character. >> > > I guess I can deal with that. > > Has anyone discovered any other issues (or had successes) dealing with ICU > syntax in RegexKitLite and RegexKitLite in general? In general, I've found the ICU and PCRE regex syntax to be essentially identical, at least for the most commonly used regex features. About the only thing I really miss from PCRE is "named capture patterns". Off the top of my head, these are features in PCRE that aren't in ICU: Named Captures (and, by extension, named back-references) Recursive and conditional patterns, along with pattern subroutines. PCRE's more elaborate backtracking control A handful of not commonly used meta-characters (ie, \R would be nice to have, and I can't think of anything else off the top of my head) So, essentially, some of the very advanced and not commonly used features. Some features present in ICU and not in PCRE: Vastly more sophisticated word breaking (regex pattern option, can do word breaking on Thai, for example) This can be a make or break feature in it's own right- if you need it, you /NEED/ it. More elaborate character sets (can perform basic (math) set operations on [] character sets: union, intersection, minus). The \p / \P {} meta-characters can accept "more" stuff, pretty much the whole gamut of Unicode properties. Not meant to be an exhaustive list, but I think it covers the majors. There's also some minor idiosyncrasies, such as when some characters need to be escaped relative to the other syntax, but they're fairly rare (I can't even come up with an example on the fly, I just remember that it's popped up from time to time). One final note- if the strings you're going to be matching are predominately "unicode heavy" (ie, not simple ASCII or MacOSRoman), definitely go for ICU / RegexKitLite. While the ICU regex engine is limited to only working on UTF-16 encoded strings, PCRE can only work on UTF-8 strings. NSString 'ranges' are always in UTF-16 code points, which for most languages is a 1:1 mapping of offset to character. UTF-8 uses a variable length encoding format and converting between UTF-8 byte offsets to UTF-16 character offsets is brutal. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com