Re: [Chicken-users] utf8 and string-ref performance

Peter Bex Wed, 24 Nov 2010 08:43:07 -0800

On Wed, Nov 24, 2010 at 09:33:24AM -0700, Alan Post wrote:
> I'm using irregex for character class matching.


The Irregex in experimental is reasonably fast for charsets, giving
O(log(n)) performance for charsets membership checking.  If the charset
is continuous (ie, with no gaps) it's actually O(1).

It's much less efficient than iset on fragmented character sets, but
on huge unbroken character sets it can be faster.  It stores vectors of
cons cells which hold the start/end ranges of subranges within the
character set, whereas iset stores small bit-vectors for subranges,
stored in a btree.

> It looks like I should be
> using srfi-14/utf8+iset instead.  Do those work only on the character level,
> am I missing a string version of those?

SRFI-14 is for dealing with characters.

> I see char-set-contains? for
> which I can determine whether a character is in the class, but I
> usually want to compare several characters in a row, as in I want to
> match the input until something isn't in the character class.

Then irregex might actually be the best way to go about it since that
can compile matchers for charset overlaps in alternatives in a smart way.

Cheers,
Peter
-- 
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                                                        -- Donald Knuth

_______________________________________________
Chicken-users mailing list
Chicken-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] utf8 and string-ref performance

Reply via email to