RE: on parrot strings

2002-01-18 Thread Brent Dax
Jarkko Hietaniemi: from attachment About the implementation of character classes: since the Unicode code point range is big, a single big bitmap won't work any more: firstly, it would be big. Secondly, for most cases, it would be wastefully sparse. A balanced binary tree of (begin, end) points

Re: on parrot strings

2002-01-18 Thread Bryan C. Warnock
Thanks, Jarrko. On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote: The most important message is that give up on 8-bit bytes, already. Time to move on, chop chop. Do you think/feel/wish/demand that the textual (string) APIs should differ from the binary (byte) APIs? (Both from an

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote: Thanks, Jarrko. On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote: The most important message is that give up on 8-bit bytes, already. Time to move on, chop chop. Do you think/feel/wish/demand that the textual

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
Since I seem to be the main regex hacker for Parrot, I'll respond to this as best I can. Currently, we are using bitmaps for character classes. Well, sort of. A Bitmap in Parrot is defined like this: typedef struct bitmap_t { char* bmp;

RE: on parrot strings

2002-01-18 Thread Hong Zhang
(1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer granularity): the problem is that a lot of space is wasted, since the

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
I don't think UTF-32 will save you much. The unicode case map is variable length, combining character, canonical equivalence, and many other thing will require variable length mapping. For example, if I only want to This is true. parse /[0-9]+/, why you want to convert everything to UTF-32.

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote: (1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer

RE: on parrot strings

2002-01-18 Thread Hong Zhang
preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not quite what I had in mind for

RE: on parrot strings

2002-01-18 Thread Hong Zhang
My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to handle all unicode category, simple casing, and decimal digit

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote: My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to

Re: on parrot strings

2002-01-18 Thread Steve Fink
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two ints, or 8 bytes. [Clarification from a noncombatant] You meant

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote: On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote: On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote: On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote: Complement of an inversion list is neat: insert 0 at the beginning (and append max+1), unless there already is one, in which case delete the 0 (and shift the list

Re: on parrot strings

2002-01-18 Thread Steve Fink
On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote: On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote: On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote: Complement of an inversion list is neat: insert 0 at the beginning (and append max+1), unless

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
We *do* want to have (with some notation) [[:digit:]\p{FunkyLooking}aeiou except 7], right? Of course. But that is all resolvable in regex compile time. No expression tree needed. My point was that if inversion lists are insufficient for describing all the character classes we

Benchmarking regexps against perl5

2002-01-18 Thread Nicholas Clark
A thought occurred to me a few days ago: If I remember correctly, attempts to benchmark parrot's developing regular expressions against perl's regular expressions are proving disappointing. However, perl5 has the advantage of a regular expression optimiser as I understand it, or at least code to

Re: on parrot strings

2002-01-18 Thread Nicholas Clark
On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote: As for character encodings, we're forcing everything to UTF-32 in regular expressions. No exceptions. If you use a string in a regex, it'll be transcoded. I honestly can't think of a better way to guarantee efficient

Parrot strings

2002-01-18 Thread Melvin Smith
Anyone have any objection to adding a couple of calls to terminate and/or return null terminated strings from Parrot strings for places where an API expects a standard C string? I'm not sure of the preferred way to handle this. It would be nice to at least try to terminate the current string

Re: on parrot strings

2002-01-18 Thread Piers Cawley
Hong Zhang [EMAIL PROTECTED] writes: preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not