Re: on parrot strings

Jarkko Hietaniemi Fri, 18 Jan 2002 13:56:46 -0800

On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
> On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> [Clarification from a noncombatant] You meant O(log N).


Duh, yes.  At least someone is awake :-)

> I like the inversion list idea. But its speed is proportional to the
> toothiness of the character class, and while I have good intuition for

Yup.

> what that means in 7-bit US-ASCII, I have no idea how bad it gets for
> other languages. "Vowels"? "Capital letters"? Would anyone ever want

As far as I can see, and guestimate (watch out for waving hands),
it would behave pretty well In Real Life.  If we are talking about the
predefined existing categories like Lu, or Greek script, or Cyrillic
block, they are pretty well localized and not scattershot.
User-specified characters are likely to be well localized to
one or few scripts.

> to select all Chinese characters with a particular radical?
> 
> That's just lookup. We should also consider other character class
> operations: union, subtraction, intersection. They're pretty
> straightforward and fast (O(N)) for inversion lists. (Yes, all these

Yes, since they are by definition sorted, merging (or negatively
merging) them is pretty simple.

> operations can be postponed until lookup time, regardless of the
> underlying represention, in which case the time of union(C1,C2) is
> just the time of C1 + time of C2 + time of an 'or'.)

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Re: on parrot strings

Reply via email to