Jarkko Hietaniemi: from attachment
About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big. Secondly, for most cases, it would be wastefully
sparse. A balanced binary tree of (begin, end) points
Thanks, Jarrko.
On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.
Do you think/feel/wish/demand that the textual (string) APIs should differ
from the binary (byte) APIs? (Both from an
On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote:
Thanks, Jarrko.
On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.
Do you think/feel/wish/demand that the textual
Since I seem to be the main regex hacker for Parrot, I'll respond to
this as best I can.
Currently, we are using bitmaps for character classes. Well, sort of.
A Bitmap in Parrot is defined like this:
typedef struct bitmap_t {
char* bmp;
(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer granularity): the problem is that a lot of
space is wasted, since the
I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
This is true.
parse /[0-9]+/, why you want to convert everything to UTF-32.
On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully
decompose input string, strip off any combining character, and do 8-bit
Hmmm. The above sounds complicated not quite what I had in mind
for
My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did
measurement on it, to handle all unicode category, simple casing,
and decimal digit
On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote:
My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did
measurement on it, to
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two ints, or 8 bytes.
[Clarification from a noncombatant] You meant
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless there already is one, in which case delete
the 0 (and shift the list
On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote:
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless
We *do* want to have (with some notation)
[[:digit:]\p{FunkyLooking}aeiou except 7], right?
Of course. But that is all resolvable in regex compile time.
No expression tree needed.
My point was that if inversion lists are insufficient for describing
all the character classes we
A thought occurred to me a few days ago:
If I remember correctly, attempts to benchmark parrot's developing regular
expressions against perl's regular expressions are proving disappointing.
However, perl5 has the advantage of a regular expression optimiser as I
understand it, or at least code to
On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:
As for character encodings, we're forcing everything to UTF-32 in
regular expressions. No exceptions. If you use a string in a regex,
it'll be transcoded. I honestly can't think of a better way to
guarantee efficient
Anyone have any objection to adding a couple of calls to terminate
and/or return null terminated strings from Parrot strings for places
where an API expects a standard C string?
I'm not sure of the preferred way to handle this. It would be nice to
at least try to terminate the current string
Hong Zhang [EMAIL PROTECTED] writes:
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully
decompose input string, strip off any combining character, and do 8-bit
Hmmm. The above sounds complicated not
20 matches
Mail list logo