On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: > ints, or 176 bytes. Searching for membership in an inversion list is > O(N log N) (binary search). "Encoding the whole range" is a non-issue > bordering on a joke: two ints, or 8 bytes.
[Clarification from a noncombatant] You meant O(log N). I like the inversion list idea. But its speed is proportional to the toothiness of the character class, and while I have good intuition for what that means in 7-bit US-ASCII, I have no idea how bad it gets for other languages. "Vowels"? "Capital letters"? Would anyone ever want to select all Chinese characters with a particular radical? That's just lookup. We should also consider other character class operations: union, subtraction, intersection. They're pretty straightforward and fast (O(N)) for inversion lists. (Yes, all these operations can be postponed until lookup time, regardless of the underlying represention, in which case the time of union(C1,C2) is just the time of C1 + time of C2 + time of an 'or'.)