Re: on parrot strings

2002-01-23 Thread Kai Henningsen
[EMAIL PROTECTED] (Russ Allbery) wrote on 22.01.02 in [EMAIL PROTECTED]: Kai Henningsen [EMAIL PROTECTED] writes: A case that (in a slightly different context) recently came up on alt.usage.german (I don't remember if this particular point was made, but it belongs): berliner -

Re: on parrot strings

2002-01-23 Thread Simon Cozens
On Wed, Jan 23, 2002 at 06:06:00PM +0200, Kai Henningsen wrote: People do get confused sometimes. I'm confused as to why this is still on p6i. Followups to alt.usage.german? Thanks. -- DESPAIR: It's Always Darkest Just Before it Gets Pitch Black

Re: on parrot strings

2002-01-21 Thread Dave Mitchell
Jarkko Hietaniemi [EMAIL PROTECTED] wrote: There is no string type built out of native eight-bit bytes. In the good ol'days, one could usefully use regexes on 8-bit binary data, eg open G, 'myfile.gif' or die; read G, $buf, 8192 or die; if ($buf =~ /^GIF89a\x08\x02/) { . where it was

Re: on parrot strings

2002-01-21 Thread Jarkko Hietaniemi
On Mon, Jan 21, 2002 at 04:37:46PM +, Dave Mitchell wrote: Jarkko Hietaniemi [EMAIL PROTECTED] wrote: There is no string type built out of native eight-bit bytes. In the good ol'days, one could usefully use regexes on 8-bit binary data, eg open G, 'myfile.gif' or die; read G, $buf,

Re: on parrot strings

2002-01-21 Thread Dave Mitchell
Jarkko Hietaniemi [EMAIL PROTECTED] wrote: In the good ol'days, one could usefully use regexes on 8-bit binary data, eg open G, 'myfile.gif' or die; read G, $buf, 8192 or die; if ($buf =~ /^GIF89a\x08\x02/) { . where it was clear to everyone that we are checking

Re: on parrot strings

2002-01-21 Thread Jarkko Hietaniemi
On Mon, Jan 21, 2002 at 05:09:06PM +, Dave Mitchell wrote: Jarkko Hietaniemi [EMAIL PROTECTED] wrote: In the good ol'days, one could usefully use regexes on 8-bit binary data, eg open G, 'myfile.gif' or die; read G, $buf, 8192 or die; if ($buf =~ /^GIF89a\x08\x02/) {

RE: on parrot strings

2002-01-21 Thread Hong Zhang
But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee`]/ I disagree. The difference between 'e' and 'e`' is similar to 'c' and 'C'. The Unicode

RE: on parrot strings

2002-01-21 Thread Hong Zhang
Yes, that's somewhat problematic. Making up a byte CEF would be Wrong, though, because there is, by definition, no CCS to map, and we would be dangerously close to conflating in CES, too... ACR-CCS-CEF-CES. Read the character model. Understand the character model. Embrace the character

RE: on parrot strings

2002-01-21 Thread Garrett Goebel
From: Hong Zhang [mailto:[EMAIL PROTECTED]] But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee`]/ I disagree. The difference

RE: on parrot strings

2002-01-21 Thread Hong Zhang
But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee`]/ I disagree. The difference between 'e' and 'e`' is similar to 'c' and

Re: on parrot strings

2002-01-21 Thread Russ Allbery
Hong Zhang [EMAIL PROTECTED] writes: I disagree. The difference between 'e' and 'e`' is similar to 'c' and 'C'. No, it's not. In many languages, an accented character is a completely different letter. It's alphabetized separately, it's pronounced differently, and there are many words that

Re: on parrot strings

2002-01-21 Thread Bryan C. Warnock
On Monday 21 January 2002 16:43, Russ Allbery wrote: Changing the capitalization of C does not change the word. Er, most of the time. -- Bryan C. Warnock [EMAIL PROTECTED]

RE: on parrot strings

2002-01-21 Thread Stephen Howard
: Monday, January 21, 2002 04:10 PM Cc: [EMAIL PROTECTED] Subject: RE: on parrot strings But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee

Re: on parrot strings

2002-01-21 Thread Russ Allbery
Bryan C Warnock [EMAIL PROTECTED] writes: On Monday 21 January 2002 16:43, Russ Allbery wrote: Changing the capitalization of C does not change the word. Er, most of the time. No, pretty much all of the time. There are differences between proper nouns and common nouns, but those are

Re: on parrot strings

2002-01-21 Thread Bryan C. Warnock
On Monday 21 January 2002 17:11, Russ Allbery wrote: No, pretty much all of the time. There are differences between proper nouns and common nouns, but those are differences routinely quashed as a typesetting decision; if you write both proper nouns and common nouns in all caps as part of a

Re: on parrot strings

2002-01-19 Thread Jarkko Hietaniemi
Honour where honour is due: I've got some questions about inversion lists. Where I saw them mentioned by that name were some drafts of this: http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html The book looks really promising-- unfortunately it's not yet published. -- $jhi++;

Re: on parrot strings

2002-01-19 Thread Graham Barr
I belive IBM use inversion lists in thier ICU library for sets of unicode characters. Graham. On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote: Honour where honour is due: I've got some questions about inversion lists. Where I saw them mentioned by that name were some drafts

Re: on parrot strings

2002-01-19 Thread Simon Cozens
On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote: http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html The book looks really promising-- unfortunately it's not yet published. Isn't this, uhm, http://www.concentric.net/~rtgillam/pubs/unibook/index.html ? --

RE: on parrot strings

2002-01-18 Thread Brent Dax
Jarkko Hietaniemi: from attachment About the implementation of character classes: since the Unicode code point range is big, a single big bitmap won't work any more: firstly, it would be big. Secondly, for most cases, it would be wastefully sparse. A balanced binary tree of (begin, end) points

Re: on parrot strings

2002-01-18 Thread Bryan C. Warnock
Thanks, Jarrko. On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote: The most important message is that give up on 8-bit bytes, already. Time to move on, chop chop. Do you think/feel/wish/demand that the textual (string) APIs should differ from the binary (byte) APIs? (Both from an

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote: Thanks, Jarrko. On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote: The most important message is that give up on 8-bit bytes, already. Time to move on, chop chop. Do you think/feel/wish/demand that the textual

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
Since I seem to be the main regex hacker for Parrot, I'll respond to this as best I can. Currently, we are using bitmaps for character classes. Well, sort of. A Bitmap in Parrot is defined like this: typedef struct bitmap_t { char* bmp;

RE: on parrot strings

2002-01-18 Thread Hong Zhang
(1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer granularity): the problem is that a lot of space is wasted, since the

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
I don't think UTF-32 will save you much. The unicode case map is variable length, combining character, canonical equivalence, and many other thing will require variable length mapping. For example, if I only want to This is true. parse /[0-9]+/, why you want to convert everything to UTF-32.

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote: (1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer

RE: on parrot strings

2002-01-18 Thread Hong Zhang
preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not quite what I had in mind for

RE: on parrot strings

2002-01-18 Thread Hong Zhang
My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to handle all unicode category, simple casing, and decimal digit

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote: My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to

Re: on parrot strings

2002-01-18 Thread Steve Fink
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two ints, or 8 bytes. [Clarification from a noncombatant] You meant

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote: On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote: On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote: ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote: On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote: Complement of an inversion list is neat: insert 0 at the beginning (and append max+1), unless there already is one, in which case delete the 0 (and shift the list

Re: on parrot strings

2002-01-18 Thread Steve Fink
On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote: On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote: On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote: Complement of an inversion list is neat: insert 0 at the beginning (and append max+1), unless

Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi
We *do* want to have (with some notation) [[:digit:]\p{FunkyLooking}aeiou except 7], right? Of course. But that is all resolvable in regex compile time. No expression tree needed. My point was that if inversion lists are insufficient for describing all the character classes we

Re: on parrot strings

2002-01-18 Thread Nicholas Clark
On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote: As for character encodings, we're forcing everything to UTF-32 in regular expressions. No exceptions. If you use a string in a regex, it'll be transcoded. I honestly can't think of a better way to guarantee efficient

Re: on parrot strings

2002-01-18 Thread Piers Cawley
Hong Zhang [EMAIL PROTECTED] writes: preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not