Re: on parrot strings

2002-01-23 Thread Simon Cozens

On Wed, Jan 23, 2002 at 06:06:00PM +0200, Kai Henningsen wrote:
> People do get confused sometimes. 

I'm confused as to why this is still on p6i. Followups to alt.usage.german?
Thanks.

-- 
DESPAIR:
It's Always Darkest Just Before it Gets Pitch Black

http://www.despair.com



Re: on parrot strings

2002-01-23 Thread Kai Henningsen

[EMAIL PROTECTED] (Russ Allbery)  wrote on 22.01.02 in 
<[EMAIL PROTECTED]>:

> Kai Henningsen <[EMAIL PROTECTED]> writes:
>
> > A case that (in a slightly different context) recently came up on
> > alt.usage.german (I don't remember if this particular point was made,
> > but it belongs):
>
> > "berliner" - attribute, belonging to, coming from, etc. Berlin.
> > "Berliner" - noun, a citizen of that city, or a jelly donut.
>
> > Two different words that only differ in capitalization.
>
> In German, do you ever write in all caps?  If so, how does it change
> things like this?

Yes (though it's usually a bad idea), and you need context to  
disambiguate, obviously. (That's where the rule about optionally  
capitalizing the sharp s as SZ instead of SS fits - how else to  
distinguish "IN MASSEN" (great amounts) from "IN MASZEN" (in moderation)?)

People do get confused sometimes. (Similarly when the capitalization at  
the start of a sentence changes one of those words.) If I had to design  
the language from scratch, that's one feature I'd try to avoid.

> You're right, I should have thought of German, where capitalization is
> used to indicate part of speech.

MfG Kai



Re: on parrot strings

2002-01-21 Thread Bryan C. Warnock

On Monday 21 January 2002 17:11, Russ Allbery wrote:
> No, pretty much all of the time.  There are differences between proper
> nouns and common nouns, but those are differences routinely quashed as a
> typesetting decision; if you write both proper nouns and common nouns in
> all caps as part of a headline, the lack of distinction is not considered
> a misspelling.  Similarly, if you capitalize the common noun because it
> occurs at the beginning of the sentence, that doesn't transform its
> meaning.

That doesn't mitigate the fact that they are different words.  Sure, English 
is forgiving, as its filled with heteronyms and homographs.  But it's all 
moot because regexes are character-oriented, not word-oriented.  

Given that they're character-oriented, we only need to provide character 
transformations between upper, lower, and title case.  But is that the 
dividing line?

>
> Whereas adding or removing an accent is always considered a misspelling,
> at least in some languages.  It's like adding or removing random letters
> from the word.

No, it's substituting letters in a word.  It's adding or removing random 
characters from the string representation of the word.

>
> re'sume' and resume are two different words.  It so happens that in
> English re'sume' is a varient spelling for one meaning of resume.  I don't
> believe that regexes should try to automatically pick up varient
> spellings.  Should the regex /aerie/ match /eyrie/?  That makes as much
> sense as a search for /resume/ matching /re'sume'/.

Varient spellings imply word-oriented searches.  We're talking about 
character-oriented transformations, and the questions is whether or not 
there's enough justification - which I feel won't come from grammatical 
rationales, but from the 7-bit ASCII storage of words with accents - to 
provide a transformation from a base letter with accents to just the base 
letter.  

Do you feel that altering accented letters to better represent them within 
the facilities provided isn't done, or is wrong?  I'm not sure what 
you're typing as your example word, and whether or not it's getting munged 
in the meantime, but "résumé"  (r, e accent, s, u, m, e accent) is coming 
across "re'sume'" (r, e, apostrophe, s, u, m, e, apostrophe).  (The incoming 
message was encoded ISO-8859-1, so presumably it should have preserved 
character 233, which is what I'm sending out.)

This isn't a ridiculous question.  Personally, I don't think that we should. 
The facilities are quickly coming into place to be able to do proper 
character encodings, and I think that we should lead from the front and 
encourage folks to be proper - not only in their searches, but in their text 
production. 


-- 
Bryan C. Warnock
[EMAIL PROTECTED]



Re: on parrot strings

2002-01-21 Thread Russ Allbery

Bryan C Warnock <[EMAIL PROTECTED]> writes:
> On Monday 21 January 2002 16:43, Russ Allbery wrote:

>> Changing the capitalization of C does not change the word. 

> Er, most of the time. 

No, pretty much all of the time.  There are differences between proper
nouns and common nouns, but those are differences routinely quashed as a
typesetting decision; if you write both proper nouns and common nouns in
all caps as part of a headline, the lack of distinction is not considered
a misspelling.  Similarly, if you capitalize the common noun because it
occurs at the beginning of the sentence, that doesn't transform its
meaning.

Whereas adding or removing an accent is always considered a misspelling,
at least in some languages.  It's like adding or removing random letters
from the word.

re'sume' and resume are two different words.  It so happens that in
English re'sume' is a varient spelling for one meaning of resume.  I don't
believe that regexes should try to automatically pick up varient
spellings.  Should the regex /aerie/ match /eyrie/?  That makes as much
sense as a search for /resume/ matching /re'sume'/.

-- 
Russ Allbery ([EMAIL PROTECTED]) 



RE: on parrot strings

2002-01-21 Thread Stephen Howard

Not to get modifier-happy, but it seems like a user-oriented solution would be to let 
the user specify a modifier:

 "caseinsensitive" =~ m/CaseInsensitive/i

 "resume" =~ m/re`sume`/d (diacritic modifier?)

-Stephen

-Original Message-
From: Hong Zhang [mailto:[EMAIL PROTECTED]]
Sent: Monday, January 21, 2002 04:10 PM
Cc: [EMAIL PROTECTED]
Subject: RE: on parrot strings


> > But e` and e are different letters man. And re`sume` and resume are
> > different words come to that. If the user wants something that'll
> > match 'em both then the pattern should surely be:
> >
> >/r[ee`]sum[ee`]/
>
> I disagree. The difference between 'e' and 'e`' is similar to 'c'
> and 'C'. The Unicode compability equivalence has similar effect
> too, such as "half width letter" and "full width letter".

German to English
 schon => already
 schön => nice

2 totally different words.

I am talking about similar word where you are talking about different word.
I don't mind if someone can search cross languages. Some Chinese search
enginee can do chinese search using engish keyword (for people having
chinese viewer but not chinese input method.) Of course, no one expect
regex engine should do that.

The "re`sume`" do appear in English sentence. The "[half|full] width letter"
are in the same language.

Hong




Re: on parrot strings

2002-01-21 Thread Bryan C. Warnock

On Monday 21 January 2002 16:43, Russ Allbery wrote:
> Changing the capitalization of C does not change the word. 

Er, most of the time. 

-- 
Bryan C. Warnock
[EMAIL PROTECTED]



Re: on parrot strings

2002-01-21 Thread Russ Allbery

Hong Zhang <[EMAIL PROTECTED]> writes:

> I disagree. The difference between 'e' and 'e`' is similar to 'c'
> and 'C'.

No, it's not.

In many languages, an accented character is a completely different letter.
It's alphabetized separately, it's pronounced differently, and there are
many words that differ only in the presence of an accent.

Changing the capitalization of C does not change the word.  Adding or
removing an accent does.

> The Unicode compability equivalence has similar effect too, such as
> "half width letter" and "full width letter".

You'll find that the Unicode compatibility equivalence does nothing as
ill-conceived as unifying e and e', for very good reason because that
would be a horrible mistake.

-- 
Russ Allbery ([EMAIL PROTECTED]) 



RE: on parrot strings

2002-01-21 Thread Hong Zhang

> > But e` and e are different letters man. And re`sume` and resume are 
> > different words come to that. If the user wants something that'll 
> > match 'em both then the pattern should surely be: 
> > 
> >/r[ee`]sum[ee`]/ 
> 
> I disagree. The difference between 'e' and 'e`' is similar to 'c' 
> and 'C'. The Unicode compability equivalence has similar effect 
> too, such as "half width letter" and "full width letter". 

German to English 
 schon => already 
 schön => nice 

2 totally different words. 

I am talking about similar word where you are talking about different word.
I don't mind if someone can search cross languages. Some Chinese search
enginee can do chinese search using engish keyword (for people having
chinese viewer but not chinese input method.) Of course, no one expect
regex engine should do that.

The "re`sume`" do appear in English sentence. The "[half|full] width letter"
are in the same language.

Hong



RE: on parrot strings

2002-01-21 Thread Garrett Goebel

From: Hong Zhang [mailto:[EMAIL PROTECTED]]
> 
> > But e` and e are different letters man. And re`sume` and resume are
> > different words come to that. If the user wants something that'll
> > match 'em both then the pattern should surely be:
> > 
> >/r[ee`]sum[ee`]/
> 
> I disagree. The difference between 'e' and 'e`' is similar to 'c'
> and 'C'. The Unicode compability equivalence has similar effect
> too, such as "half width letter" and "full width letter".

German to English
 schon => already
 schön => nice

2 totally different words.

I'm sure there are words in some language where the difference between a 'e'
and 'e`' can be the difference between an insult and a compliment.
 



RE: on parrot strings

2002-01-21 Thread Hong Zhang

> Yes, that's somewhat problematic.  Making up "a byte CEF" would be
> Wrong, though, because there is, by definition, no CCS to map, and
> we would be dangerously close to conflating in CES, too...
> ACR-CCS-CEF-CES.  Read the character model.  Understand the character
> model.  Embrace the character model.  Be the character model.  (And
> once you're it, read the relevant Unicode, XML, and Web standards.)
> 
> To highlight the difference between opaque numbers and characters,
> the above should really be:
> 
>   if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... }
> 
> I think what needs to be done is that \xHH must not be encoded as
> literals (as it is now, 'A' and \x41 are identical (in ASCII)), but
> instead as regex nodes of their own, storing the code points.  Then
> the regex engine can try both the "right/new way" (the Unicode code
> point), and the "wrong/legacy way" (the native code point).

My suggest will be add a binary mode, such as //b. When binary mode
is in effect, only ascii characters (0 - 127) still carry text property.
\p{IsLower} will only match ascii a to z. All 128 - 255 always have false
text property. Any code points must be between 0 and 255. The regcomp
can easily check it upon compilation.

A dedicated binary mode will simplify many issues. And the regex will
be very readable. We can make binary mode be exclusive with text mode,
i.e. and regex expression must be either binary or text, but not both.
(I am not sure if it is really useful to have mixed mode.)

Hong



RE: on parrot strings

2002-01-21 Thread Hong Zhang

> But e` and e are different letters man. And re`sume` and resume are
> different words come to that. If the user wants something that'll
> match 'em both then the pattern should surely be:
> 
>/r[ee`]sum[ee`]/

I disagree. The difference between 'e' and 'e`' is similar to 'c'
and 'C'. The Unicode compability equivalence has similar effect
too, such as "half width letter" and "full width letter".

It may just be my personal perference. But I don't think it is
good idea to push this problem to user of regex.

Hong



Re: on parrot strings

2002-01-21 Thread Jarkko Hietaniemi

On Mon, Jan 21, 2002 at 05:09:06PM +, Dave Mitchell wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> > > In the good ol'days, one could usefully use regexes on 8-bit binary data,
> > > eg
> > > 
> > > open G, 'myfile.gif' or die;
> > > read G, $buf, 8192 or die;
> > > if ($buf =~ /^GIF89a\x08\x02/) {
> > > .
> > > 
> > > where it was clear to everyone that we are checking whether the first few
> > > bytes of the file contain (0x47, 0x49, ..., 0x02)
> > > 
> > > Is this sort of thing now completely dead in the Brave New World of
> > 
> > Of course not, I do not remember forbiddding \xHH.  The default of
> > data coming in from filehandles could still be opaque 8-bit bytes.
> 
> Good :-)
> 
> I'm not clear though, how binary data could get passed to parrot's
> regex engine, unless there's a BINARY_8 CEF in addition to
> UNICODE_CEF_UTF_8 etc in C

Yes, that's somewhat problematic.  Making up "a byte CEF" would be
Wrong, though, because there is, by definition, no CCS to map, and
we would be dangerously close to conflating in CES, too...
ACR-CCS-CEF-CES.  Read the character model.  Understand the character
model.  Embrace the character model.  Be the character model.  (And
once you're it, read the relevant Unicode, XML, and Web standards.)

To highlight the difference between opaque numbers and characters,
the above should really be:

if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... }

I think what needs to be done is that \xHH must not be encoded as
literals (as it is now, 'A' and \x41 are identical (in ASCII)), but
instead as regex nodes of their own, storing the code points.  Then
the regex engine can try both the "right/new way" (the Unicode code
point), and the "wrong/legacy way" (the native code point).

String literals have the same problem.  What does "foo\x41" mean?
(Here, unlike with the regular expressions, we can't "try both",
unless we integrate Damian's quantum state variables to the core :-)
We have various options: there might be a pragma to tell what CCS
"naked codepoints" are to be understood in, or the default could be
grovelled out of environment settings (both these options could affect
the regex solution, too), and so forth.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-21 Thread Dave Mitchell

Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> > In the good ol'days, one could usefully use regexes on 8-bit binary data,
> > eg
> > 
> > open G, 'myfile.gif' or die;
> > read G, $buf, 8192 or die;
> > if ($buf =~ /^GIF89a\x08\x02/) {
> > .
> > 
> > where it was clear to everyone that we are checking whether the first few
> > bytes of the file contain (0x47, 0x49, ..., 0x02)
> > 
> > Is this sort of thing now completely dead in the Brave New World of
> 
> Of course not, I do not remember forbiddding \xHH.  The default of
> data coming in from filehandles could still be opaque 8-bit bytes.

Good :-)

I'm not clear though, how binary data could get passed to parrot's
regex engine, unless there's a BINARY_8 CEF in addition to
UNICODE_CEF_UTF_8 etc in C

???




Re: on parrot strings

2002-01-21 Thread Jarkko Hietaniemi

On Mon, Jan 21, 2002 at 04:37:46PM +, Dave Mitchell wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> > There is no string type built out of native eight-bit bytes.
> 
> In the good ol'days, one could usefully use regexes on 8-bit binary data,
> eg
> 
> open G, 'myfile.gif' or die;
> read G, $buf, 8192 or die;
> if ($buf =~ /^GIF89a\x08\x02/) {
> .
> 
> where it was clear to everyone that we are checking whether the first few
> bytes of the file contain (0x47, 0x49, ..., 0x02)
> 
> Is this sort of thing now completely dead in the Brave New World of

Of course not, I do not remember forbiddding \xHH.  The default of
data coming in from filehandles could still be opaque 8-bit bytes.

> Unicode, Locales etc etc? (yes, there's always pack, but pack is so... errr
> hmm )

> Dave.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-21 Thread Dave Mitchell

Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> There is no string type built out of native eight-bit bytes.

In the good ol'days, one could usefully use regexes on 8-bit binary data,
eg

open G, 'myfile.gif' or die;
read G, $buf, 8192 or die;
if ($buf =~ /^GIF89a\x08\x02/) {
.

where it was clear to everyone that we are checking whether the first few
bytes of the file contain (0x47, 0x49, ..., 0x02)

Is this sort of thing now completely dead in the Brave New World of
Unicode, Locales etc etc? (yes, there's always pack, but pack is so... errr
hmm )

Dave.




Re: on parrot strings

2002-01-19 Thread Simon Cozens

On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote:
> http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html
> The book looks really promising-- unfortunately it's not yet published.

Isn't this, uhm, http://www.concentric.net/~rtgillam/pubs/unibook/index.html ?

-- 
Sigh.  I like to think it's just the Linux people who want to be on
the "leading edge" so bad they walk right off the precipice.
(Craig E. Groeschel)



Re: on parrot strings

2002-01-19 Thread Graham Barr

I belive IBM use inversion lists in thier ICU library for sets of
unicode characters.

Graham.

On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote:
> Honour where honour is due: I've got some questions about inversion
> lists.  Where I saw them mentioned by that name were some drafts of
> this:
> 
> http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html
> 
> The book looks really promising-- unfortunately it's not yet published.
> 
> -- 
> $jhi++; # http://www.iki.fi/jhi/
> # There is this special biologist word we use for 'stable'.
> # It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-19 Thread Jarkko Hietaniemi

Honour where honour is due: I've got some questions about inversion
lists.  Where I saw them mentioned by that name were some drafts of
this:

http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html

The book looks really promising-- unfortunately it's not yet published.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Piers Cawley

Hong Zhang <[EMAIL PROTECTED]> writes:

>> > preprocessing. Another example, if I want to search for /resume/e,
>> > (equivalent matching), the regex engine can normalize the case, fully 
>> > decompose input string, strip off any combining character, and do 8-bit
>> 
>> Hmmm.  The above sounds complicated not quite what I had in mind
>> for equivalence matching: I would have just said "both the pattern
>> and the target need to normalized, as defined by Unicode".  Then 
>> the comparison and searching reduce to the trivial cases of byte
>> equivalence and searching (of which B-M is the most popular example).
>
> You are right in some sense. But "normalized, as defined by Unicode"
> may not be simple. I look at unicode regex tr18. It does not specify
> equivalence of "resume" vs "re`sume`", but user may want or may not
> want this kind of normalization.

But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:

   /r[ee`]sum[ee`]/

Of course, it might be nice to have something that lets us do

   /r\any_accented(e)sum\any_accented(e)/

(or some such, notation is terrible I know), but my point is that such
searches should be explicit.

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?




Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 11:40:17PM +, Nicholas Clark wrote:
> On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:
> 
> > > As for character encodings, we're forcing everything to UTF-32 in
> > > regular expressions.  No exceptions.  If you use a string in a regex,
> > > it'll be transcoded.  I honestly can't think of a better way to
> > > guarantee efficient string indexing.
> > 
> > I'm fine with that.  The bloat is of course a shame, but as long as
> > that's not a real problem for someone, let's not worry about it too
> > much.
> 
> Forcing everything to UTF-32 in the API?

I think Brent meant UTF-32 internally for the regexen.  When you say
/a/, Parrot sees 0x00 0x00 0x00 0x41.

> To me it seems that making UTF-32 do everything correctly which the real
> world can use while encoding optimised versions are written is better than
> having a snazzy 4 encoding autoswitcher that is wrong and therefore not
> releasable to the world.

Now, now.

But yes, maybe selecting *one* first (and getting its implementation
right) would be good, and in that case it's either UTF-16 (which is
reasonably compact, but variable length), or UTF-32 (which is a bit
asteful, but fixed length, and therefore easy to think in).
So I guess UTF-32 wins.

> But I don't know about how the internals of all these things work, so I
> may well be wrong on any technical detail.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Nicholas Clark

On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:

> > As for character encodings, we're forcing everything to UTF-32 in
> > regular expressions.  No exceptions.  If you use a string in a regex,
> > it'll be transcoded.  I honestly can't think of a better way to
> > guarantee efficient string indexing.
> 
> I'm fine with that.  The bloat is of course a shame, but as long as
> that's not a real problem for someone, let's not worry about it too
> much.

Forcing everything to UTF-32 in the API?
Or just forcing everything to UTF-32 until perl 6.0 is released, as trying
to do UTF-8 (and UTF-16 ...) regexps now is premature optimisation?

To me it seems that making UTF-32 do everything correctly which the real
world can use while encoding optimised versions are written is better than
having a snazzy 4 encoding autoswitcher that is wrong and therefore not
releasable to the world.

But I don't know about how the internals of all these things work, so I
may well be wrong on any technical detail.


Nicholas Clark
-- 
ENOCHOCOLATE http://www.ccl4.org/~nick/CV.html



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

> > > We *do* want to have (with some notation)
> > > [[:digit:]\p{FunkyLooking}aeiou except 7], right?
> > 
> > Of course.  But that is all resolvable in regex compile time.
> > No expression tree needed.
> 
> My point was that if inversion lists are insufficient for describing
> all the character classes we might be interested in, then we'll need
> the tree. And an example of why inversion lists would be insufficient
> is if we have a character API that only allows queries of the sort "is
> this character FunkyLooking or not?", rather than "what ranges of
> characters are FunkyLooking?" (Unless you want to do "is 0
> FunkyLooking? is 1 FunkyLooking? ... is 4294967295 FunkyLooking?" at
> compile time.)

I think the answer to that dilemma is obvious: we do want an API that
tells which ranges FunkyLooking covers and guess what: the answers
to such questions can be represented as inversion lists.

> > compile time.  (Don't say "locales" or I'll ha've have to hurt you,
> > for your own good. :-)
> 
> Was the ' in ha've unintentional, or is that an acute accent mark? :-)

I was aiming for pirate accent.  Arr.  Discussing parrots and all.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Steve Fink

On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote:
> On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
> > On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> > > Complement of an inversion list is neat: insert 0 at the beginning
> > > (and append max+1), unless there already is one, in which case delete
> > > the 0 (and shift the list and delete the max+1).  Again, O(N). 
> > > (One could of course have a bit for a 'negative character class',
> > > but that would in turn complicate the computations.)
> > 
> > If we have hybrid notation, we'll be stuck with not only a bit for
> > that, but also a complete expression tree for character classes.
> > (Which is necessary if we use a Unicode library that only exposes
> > property test functions, not numeric ranges.)
> > 
> > We *do* want to have (with some notation)
> > [[:digit:]\p{FunkyLooking}aeiou except 7], right?
> 
> Of course.  But that is all resolvable in regex compile time.
> No expression tree needed.

My point was that if inversion lists are insufficient for describing
all the character classes we might be interested in, then we'll need
the tree. And an example of why inversion lists would be insufficient
is if we have a character API that only allows queries of the sort "is
this character FunkyLooking or not?", rather than "what ranges of
characters are FunkyLooking?" (Unless you want to do "is 0
FunkyLooking? is 1 FunkyLooking? ... is 4294967295 FunkyLooking?" at
compile time.)

> compile time.  (Don't say "locales" or I'll ha've have to hurt you,
> for your own good. :-)

Was the ' in ha've unintentional, or is that an acute accent mark? :-)



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
> On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> > Complement of an inversion list is neat: insert 0 at the beginning
> > (and append max+1), unless there already is one, in which case delete
> > the 0 (and shift the list and delete the max+1).  Again, O(N). 
> > (One could of course have a bit for a 'negative character class',
> > but that would in turn complicate the computations.)
> 
> If we have hybrid notation, we'll be stuck with not only a bit for
> that, but also a complete expression tree for character classes.
> (Which is necessary if we use a Unicode library that only exposes
> property test functions, not numeric ranges.)
> 
> We *do* want to have (with some notation)
> [[:digit:]\p{FunkyLooking}aeiou except 7], right?

Of course.  But that is all resolvable in regex compile time.
No expression tree needed.

[[:digit:]\p{FunkyLooking}aeiou$FooBar] is an ickier case,
but even there the constant parts can be resolved in regex
compile time.  (Don't say "locales" or I'll ha've have to hurt you,
for your own good. :-)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Steve Fink

On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
> Complement of an inversion list is neat: insert 0 at the beginning
> (and append max+1), unless there already is one, in which case delete
> the 0 (and shift the list and delete the max+1).  Again, O(N). 
> (One could of course have a bit for a 'negative character class',
> but that would in turn complicate the computations.)

If we have hybrid notation, we'll be stuck with not only a bit for
that, but also a complete expression tree for character classes.
(Which is necessary if we use a Unicode library that only exposes
property test functions, not numeric ranges.)

We *do* want to have (with some notation)
[[:digit:]\p{FunkyLooking}aeiou except 7], right?



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
> On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> [Clarification from a noncombatant] You meant O(log N).
> 
> I like the inversion list idea. But its speed is proportional to the
> toothiness of the character class, and while I have good intuition for
> what that means in 7-bit US-ASCII, I have no idea how bad it gets for
> other languages. "Vowels"? "Capital letters"? Would anyone ever want
> to select all Chinese characters with a particular radical?
> 
> That's just lookup. We should also consider other character class
> operations: union, subtraction, intersection. They're pretty

Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless there already is one, in which case delete
the 0 (and shift the list and delete the max+1).  Again, O(N). 
(One could of course have a bit for a 'negative character class',
but that would in turn complicate the computations.)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
> On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> [Clarification from a noncombatant] You meant O(log N).

Duh, yes.  At least someone is awake :-)

> I like the inversion list idea. But its speed is proportional to the
> toothiness of the character class, and while I have good intuition for

Yup.

> what that means in 7-bit US-ASCII, I have no idea how bad it gets for
> other languages. "Vowels"? "Capital letters"? Would anyone ever want

As far as I can see, and guestimate (watch out for waving hands),
it would behave pretty well In Real Life.  If we are talking about the
predefined existing categories like Lu, or Greek script, or Cyrillic
block, they are pretty well localized and not scattershot.
User-specified characters are likely to be well localized to
one or few scripts.

> to select all Chinese characters with a particular radical?
> 
> That's just lookup. We should also consider other character class
> operations: union, subtraction, intersection. They're pretty
> straightforward and fast (O(N)) for inversion lists. (Yes, all these

Yes, since they are by definition sorted, merging (or negatively
merging) them is pretty simple.

> operations can be postponed until lookup time, regardless of the
> underlying represention, in which case the time of union(C1,C2) is
> just the time of C1 + time of C2 + time of an 'or'.)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Steve Fink

On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
> ints, or 176 bytes. Searching for membership in an inversion list is
> O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> bordering on a joke: two ints, or 8 bytes.

[Clarification from a noncombatant] You meant O(log N).

I like the inversion list idea. But its speed is proportional to the
toothiness of the character class, and while I have good intuition for
what that means in 7-bit US-ASCII, I have no idea how bad it gets for
other languages. "Vowels"? "Capital letters"? Would anyone ever want
to select all Chinese characters with a particular radical?

That's just lookup. We should also consider other character class
operations: union, subtraction, intersection. They're pretty
straightforward and fast (O(N)) for inversion lists. (Yes, all these
operations can be postponed until lookup time, regardless of the
underlying represention, in which case the time of union(C1,C2) is
just the time of C1 + time of C2 + time of an 'or'.)



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote:
> > > My proposal is we should use mix method. The Unicode standard class,
> > > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > > measurement on it, to handle all unicode category, simple casing,
> > > and decimal digit value, I need about 23KB table for Unicode 3.1
> > > (0x0 to 0x10), about 15KB for (0x0 to 0x).
> > 
> > Don't try to compete with inversion lists on the size: their size is
> > measured in bytes.  For example "Latin script", which consists of 22
> > separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> When I said mixed method, I did intend to include binary search. The binary
> search is a win for sparse character class. But bitmap is better for large
> one.

"Better" in what sense?  Smaller?  Certainly not.  Faster?  Maybe, maybe not.
Yes, accessing the right bytes and doing the bit arithmetics is about as
fast as one can hope doing anything in CPUs.  But: the 15KB is quite a lot
of stuff to move around for, say, [0-9].  Yes, bitmaps win in pathological
cases where you, say, choose every other character of the Unicode.

I guess I agree with you that a combination of bitmaps and binary
searchable things (inversion lists or trees) is good, but I guess we
differ in that my gut feeling is that the latter should be the
default, not the bitmaps.

I also think this low-level detail should be completely hidden from,
say, the writers of the regex engine, all they should see is
code_point_in_class(cp, cc), and that the low-level "character class
engine" should dynamically pick whichever low-level implementation is
"best", and naturally that only one of the low-level implementations
is being used (for one character class) at a time: hybrids (meaning
dual book-keeping) sound to me like a fruitful breeding area for bugs.

> Python uses two level bitmap for first 64K character.

And their Unicode implementation is doing how well? :-)

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



RE: on parrot strings

2002-01-18 Thread Hong Zhang

> > My proposal is we should use mix method. The Unicode standard class,
> > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > measurement on it, to handle all unicode category, simple casing,
> > and decimal digit value, I need about 23KB table for Unicode 3.1
> > (0x0 to 0x10), about 15KB for (0x0 to 0x).
> 
> Don't try to compete with inversion lists on the size: their size is
> measured in bytes.  For example "Latin script", which consists of 22
> separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> ints, or 176 bytes. Searching for membership in an inversion list is
> O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> bordering on a joke: two ints, or 8 bytes.

When I said mixed method, I did intend to include binary search. The binary
search is a win for sparse character class. But bitmap is better for large
one. Python uses two level bitmap for first 64K character.

Hong



RE: on parrot strings

2002-01-18 Thread Hong Zhang

> > preprocessing. Another example, if I want to search for /resume/e,
> > (equivalent matching), the regex engine can normalize the case, fully 
> > decompose input string, strip off any combining character, and do 8-bit
> 
> Hmmm.  The above sounds complicated not quite what I had in mind
> for equivalence matching: I would have just said "both the pattern
> and the target need to normalized, as defined by Unicode".  Then 
> the comparison and searching reduce to the trivial cases of byte
> equivalence and searching (of which B-M is the most popular example).

You are right in some sense. But "normalized, as defined by Unicode"
may not be simple. I look at unicode regex tr18. It does not specify
equivalence of "resume" vs "re`sume`", but user may want or may not
want this kind of normalization.

Hong



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
> > (1) There are 5.125 bytes in Unicode, not four.
> > (2) I think the above would suffer from the same problem as one common
> > suggestion, two-level bitmaps (though I think the above would suffer
> > less, being of finer granularity): the problem is that a lot of
> > space is wasted, since the "usage patterns" of Unicode character
> > classes tend to be rather scattered and irregular.  Yes, I see
> > that you said: "only the arrays that we actually used would be
> > allocated to save space"-- which reads to me: much complicated
> > logic both in creation and access to make the data 
> > structure *look*
> > simple.  I'm a firm believer in getting the data structures right,
> > after which the code to access them almost writes itself.
> > 
> > I would suggest the inversion lists for the first try.  As long as
> > character classes are not very dynamic once they have been created
> > (and at least traditionally that has been the case), inversion lists
> > should work reasonably well.
> 
> My proposal is we should use mix method. The Unicode standard class,
> such as \p{IsLu}, can be handled by a standard splitbin table. Please
> see Java java.lang.Character or Python unicodedata_db.h. I did 
> measurement on it, to handle all unicode category, simple casing,
> and decimal digit value, I need about 23KB table for Unicode 3.1
> (0x0 to 0x10), about 15KB for (0x0 to 0x).

Don't try to compete with inversion lists on the size: their size is
measured in bytes.  For example "Latin script", which consists of 22
separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search).  "Encoding the whole range" is a non-issue
bordering on a joke: two ints, or 8 bytes.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

> I don't think UTF-32 will save you much. The unicode case map is variable
> length, combining character, canonical equivalence, and many other thing
> will require variable length mapping. For example, if I only want to

This is true.

> parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
> time, the regcomp() can find out whether this regexp will need complicated
> preprocessing. Another example, if I want to search for /resume/e,
> (equivalent matching), the regex engine can normalize the case, fully 
> decompose input string, strip off any combining character, and do 8-bit

Hmmm.  The above sounds complicated not quite what I had in mind
for equivalence matching: I would have just said "both the pattern
and the target need to normalized, as defined by Unicode".  Then 
the comparison and searching reduce to the trivial cases of byte
equivalence and searching (of which B-M is the most popular example).

> Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
> (BTW, the equivalent matching means match English spelling against French
> spell, disregarding diacritics.)
> 
> I think we should explore more choices and do some experiments.

What do you mean by *we*? :-) I am not a p6-internals regular, nor do
I intend to, there are only so many hours in a day.  But yes, the
sooner we get into exploration/experiment mode, the better.  The
Unicode mindset *must* be adopted sooner rather than later,
"unwriting" 8-bit-byteism out of the code later is hell.  Hopefully my
little treatise will kick Parrot more or less in the right direction.

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



RE: on parrot strings

2002-01-18 Thread Hong Zhang

> (1) There are 5.125 bytes in Unicode, not four.
> (2) I think the above would suffer from the same problem as one common
> suggestion, two-level bitmaps (though I think the above would suffer
> less, being of finer granularity): the problem is that a lot of
> space is wasted, since the "usage patterns" of Unicode character
> classes tend to be rather scattered and irregular.  Yes, I see
> that you said: "only the arrays that we actually used would be
> allocated to save space"-- which reads to me: much complicated
> logic both in creation and access to make the data 
> structure *look*
> simple.  I'm a firm believer in getting the data structures right,
> after which the code to access them almost writes itself.
> 
> I would suggest the inversion lists for the first try.  As long as
> character classes are not very dynamic once they have been created
> (and at least traditionally that has been the case), inversion lists
> should work reasonably well.

My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did 
measurement on it, to handle all unicode category, simple casing,
and decimal digit value, I need about 23KB table for Unicode 3.1
(0x0 to 0x10), about 15KB for (0x0 to 0x).

For simple character class, such as [\p{IsLu}\p{InGreak}], the regex
does not need to emit optimized bitmap. Instead, the regex just generate
an union, the first one will use standard unicode category lookup, the
second one is a simple range.

If user mandate to use fast bitmap, and the character class is not
extremely complicated, we will only probably need about several K for
each char class.

> > As for character encodings, we're forcing everything to UTF-32 in
> > regular expressions.  No exceptions.  If you use a string in a regex,
> > it'll be transcoded.  I honestly can't think of a better way to
> > guarantee efficient string indexing.

I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
time, the regcomp() can find out whether this regexp will need complicated
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully 
decompose input string, strip off any combining character, and do 8-bit
Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
(BTW, the equivalent matching means match English spelling against French
spell, disregarding diacritics.)

I think we should explore more choices and do some experiments.

Hong



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

> Since I seem to be the main regex hacker for Parrot, I'll respond to
> this as best I can.
> 
> Currently, we are using bitmaps for character classes.  Well, sort of.
> A Bitmap in Parrot is defined like this:
> 
>   typedef struct bitmap_t {
>   char*   bmp;
>   STRING* bigchars;
>   } Bitmap;
> 
> Characters <256 are stored as a bitmap in bmp; other characters are
> stored in bigchars and linear-searched.  This is a temporary measure,

This is similar to how Perl 5 does them: the low eight bits are in a
32-byte bitmap, the "wide characters" are stored after it (in a funky
data structure, I won't go into more detail so that people won't lose
their lunch/breakfast/meal)

> since Parrot isn't yet dealing with many characters outside of ASCII.
> Several schemes have been proposed for the final version; I'm currently
> leaning towards an array of arrays of arrays of bitmaps (one level for
> each byte of the character):
> 
>   INTVAL ch;
>   return
> bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
> >>3] & (1<<(FORTH_BYTE(ch) & 7));

dup + dup *  ... oh, you meant FOURTH.

> Ungainly, but it works.  It would actually be a bit more
> complicated--only the arrays that we actually used would be allocated to
> save space--but you get the idea.  (However, I'm quite flexible on the
> implementation chosen.  I'll look at the ideas you propose in more
> detail; if anyone else has any suggestions, suggest them.)

Ungainly, yes.

(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer granularity): the problem is that a lot of
space is wasted, since the "usage patterns" of Unicode character
classes tend to be rather scattered and irregular.  Yes, I see
that you said: "only the arrays that we actually used would be
allocated to save space"-- which reads to me: much complicated
logic both in creation and access to make the data structure *look*
simple.  I'm a firm believer in getting the data structures right,
after which the code to access them almost writes itself.

I would suggest the inversion lists for the first try.  As long as
character classes are not very dynamic once they have been created
(and at least traditionally that has been the case), inversion lists
should work reasonably well.

> As for character encodings, we're forcing everything to UTF-32 in
> regular expressions.  No exceptions.  If you use a string in a regex,
> it'll be transcoded.  I honestly can't think of a better way to
> guarantee efficient string indexing.

I'm fine with that.  The bloat is of course a shame, but as long as
that's not a real problem for someone, let's not worry about it too
much.

> --Brent Dax
> [EMAIL PROTECTED]
> Parrot Configure pumpking and regex hacker
> 
>  . hawt sysadmin chx0rs
>  This is sad. I know of *a* hawt sysamin chx0r.
>  I know more than a few.
>  obra: There are two? Are you sure it's not the same one?

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Jarkko Hietaniemi

On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote:
> Thanks, Jarrko.
> 
> On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
> > The most important message is that give up on 8-bit bytes, already.
> > Time to move on, chop chop.
> 
> Do you think/feel/wish/demand that the textual (string) APIs should differ 
> from the binary (byte) APIs?  (Both from an internal Parrot perspective and 
> at the language level.)

I tried to address this issue at two points in the document, "Of Bits
and Bytes", and one paragraph in "TO DO" talking about encoding
conversions and I/O.  But I guess the answer is "yes and yes", I think
the APIs should be different.  It pains my UNIX heart but thinking in
terms of just bytes was a convenient illusion that worked as long we
kept ourselves to 8-bit byte character sets.  I think the illusion
works no more.

> This may be beyond the scope of the document, but do you have an opinion on 
> whether strings need to be entirely encapsulated within a single structure, 
> or whether "virtual" strings (comprising several disparate substrings) are a 
> viable addition?  
> 
>   typedef struct {
>UINTVALsize;
>UINTVALindex;
>UINTVALindex_offset;
>UINTVALlast_offset;
>UINTVALsize_valid:1;
>UINTVALoffset_valid:1;
>UINTVALlast_valid:1;
>UINTVALcontinued:1;
>PARROT_STRING  string;
>PARROT_SIZED_STRINGstring_continued;
>   } PARROT_SIZED_STRING

First off, I think virtual strings (if you define strings as "a linear
collection of characters (or bytes)" are a great idea, that's why I
suggested them a while ago even in the context of Perl 5 (though I
admit I also simply liked the proposed name: VVs...)  But I also think
they are high-level enough that they probably should not be any of the
low-level string structures.  For example: one nifty thing you can do
with virtual strings is that they can be read-only windows to another
string, and I don't think the read-onlyness flag belongs to the
low-level strings: it's something coming from above.  Similarly
from virtual strings composed of slices of several other strings:
how do you manage the book-keeping of these other strings?  Too complex:
let's keep the low-level, ummm, low-level.

> This was discussed earlier mostly for alleviating some of the headaches 
> associated with variable-width encodings. 

If we keep the low-level limited to just a handful of encodings
(I proposed three), and the variable encodings well-behaved (UTF-8 as
opposed to the gnarlier ones), I don't think the burden will be too bad.

> -- 
> Bryan C. Warnock
> [EMAIL PROTECTED]

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: on parrot strings

2002-01-18 Thread Bryan C. Warnock

Thanks, Jarrko.

On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
> The most important message is that give up on 8-bit bytes, already.
> Time to move on, chop chop.

Do you think/feel/wish/demand that the textual (string) APIs should differ 
from the binary (byte) APIs?  (Both from an internal Parrot perspective and 
at the language level.)

This may be beyond the scope of the document, but do you have an opinion on 
whether strings need to be entirely encapsulated within a single structure, 
or whether "virtual" strings (comprising several disparate substrings) are a 
viable addition?  

typedef struct {
 UINTVALsize;
 UINTVALindex;
 UINTVALindex_offset;
 UINTVALlast_offset;
 UINTVALsize_valid:1;
 UINTVALoffset_valid:1;
 UINTVALlast_valid:1;
 UINTVALcontinued:1;
 PARROT_STRING  string;
 PARROT_SIZED_STRINGstring_continued;
} PARROT_SIZED_STRING

This was discussed earlier mostly for alleviating some of the headaches 
associated with variable-width encodings. 

-- 
Bryan C. Warnock
[EMAIL PROTECTED]



RE: on parrot strings

2002-01-18 Thread Brent Dax

Jarkko Hietaniemi: 
About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big.  Secondly, for most cases, it would be wastefully
sparse.  A balanced binary tree of (begin, end) points of ranges is
suggested.  That would seem to give the required flexibility and
reasonable compromise betwen speed and space for implementing the
operations required by both the traditional regular expression
character classes (complement, case-ignorance) and the new Unicode
character class semantics (difference, intersection) (see the Unicode
Technical Report #18, I,
http://www.unicode.org/unicode/reports/tr18/ )

Another, possible simpler way would be to use inversion lists:
1-dimensional arrays where odd (starting from zero) indices store
the beginnings of ranges belonging to the class, and and even indices
store the beginnings of ranges not belonging to the class.
Note "array" instead of (a linked) "list": with an array one can do
binary search to determine membership (so an inversion list is in
effect a flattened binary tree).  Yet another way would be to use
various two-level table schemes.  The choice of the appropriate data
structure, as always, depends on the expected operational (read vs
modify) mix and the expected data distribution.

###

Since I seem to be the main regex hacker for Parrot, I'll respond to
this as best I can.

Currently, we are using bitmaps for character classes.  Well, sort of.
A Bitmap in Parrot is defined like this:

typedef struct bitmap_t {
char*   bmp;
STRING* bigchars;
} Bitmap;

Characters <256 are stored as a bitmap in bmp; other characters are
stored in bigchars and linear-searched.  This is a temporary measure,
since Parrot isn't yet dealing with many characters outside of ASCII.
Several schemes have been proposed for the final version; I'm currently
leaning towards an array of arrays of arrays of bitmaps (one level for
each byte of the character):

INTVAL ch;
return
bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
>>3] & (1<<(FORTH_BYTE(ch) & 7));

Ungainly, but it works.  It would actually be a bit more
complicated--only the arrays that we actually used would be allocated to
save space--but you get the idea.  (However, I'm quite flexible on the
implementation chosen.  I'll look at the ideas you propose in more
detail; if anyone else has any suggestions, suggest them.)

As for character encodings, we're forcing everything to UTF-32 in
regular expressions.  No exceptions.  If you use a string in a regex,
it'll be transcoded.  I honestly can't think of a better way to
guarantee efficient string indexing.

--Brent Dax
[EMAIL PROTECTED]
Parrot Configure pumpking and regex hacker

 . hawt sysadmin chx0rs
 This is sad. I know of *a* hawt sysamin chx0r.
 I know more than a few.
 obra: There are two? Are you sure it's not the same one?




on parrot strings

2002-01-17 Thread Jarkko Hietaniemi

Dan suggested that I make this document public.

I've been reluctant to do so, for several reasons, the most important
being that I'm far, FAR, behind in the current Parrot developments.

Therefore, this document will probably rudely stomp right over any
recent string, encoding and regular expression work that has been
going on. Its idea of vtables are just much handwaving, I really
don't know how vtables really work or how they are currently
implemented in Parrot.

However, I have spent some serious thought on this, based on the
appropriate amount of fun I've had lately with the Unicode in the
upcoming Perl 5.8, and some familiarity with the Unicode standard.

The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.

The document itself is a collection (what we experts technically call
"a mess") of theory and practice: character model and statements about
how things should be, and how things should probably be implemented.
I've asked Larry about whether the ICU license would be compatible
with whatever Parrot will have, but he's currently cruising...

The bad (or good?) news is that I don't have the time to work
on this design document much more, maintain or develop it,
and certainly not implement any of it (another reason why I
have been reluctant to publish this).  But that's how it is.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen


=head1 TITLE

Parrot String Handling

=head1 VERSION

$Id: parrotstring.pod,v 1.16 2002/01/18 03:57:07 jhi Exp $

=head2 Current

Maintainer: Some Body
Class: Internals
PDD Number: X
Version: 1.0
Status: Developing
Last Modified: 
PDD Format: 1
Language: English

=head2 History

None.  First version.  (Jarkko Hietaniemi)

=head1 CHANGES

None.  First version.

=head1 ABSTRACT

This PDD specifies how Parrot should handle characters and text,
the character encoding model, and the basic operations.

=head1 DESCRIPTION

Since discussion of character sets, encodings, and the like quite
often confuses people considerably, let's first define some terms for
our character encoding model.  The terminology follows the Unicode
Technical Report #17, I,
http://www.unicode.org/unicode/reports/tr17/ .  The character
encoding model consists of five layers:

=head2 Terminology

=over 4

=item I

I - the (unordered) set of characters
to be encoded.  Unicode defines one (open-ended) repertoire, the ISO
8859 are examples of (closed) repertoires.

=item I

I - a mapping from an abstract character
repertoire to a set of non-negative integers known as I.
This is how the Unicode C is mapped to U+0041.
Alternative names for CCSs are I, a I, a , or a .
Examples include Unicode, ISO 8859, the Windows code pages, and the
JIS X 0208.  The code point space (from zero to some positive number)
can be flat or complex (some ranges may be unused or forbidden).

=item I

I - a mapping from a set of non-negative
integers (a CCS) to set of sequences of particular I of
some specified (computer-architecture imposed) width, such as bytes.

CEFs can be fixed-width or variable-width.  Note that (especially
historically) in many cases CCS and CEF have used the same (often
byte) fixed width, so they have been practically identical.  But don't
let that fool you.

A byte sequence can be either I (it represents a valid
and assigned character in the code point space), I
(it represents a valid but unused code point), or I
(for some reason forbidden).

Examples of fixed-width CEFs include 7-bit, 8-bit, 16-bit, and 32-bit.
Note that these are independent of any particular CCS.  8-bit CEF can
be used for encoding several different CCSs, such as the ISO 8859 and
the different EBCDICs.

Examples of variable-width CEFs include UTF-8 and UTF-16.

=item I

I - a mapping from a set of sequences
of code units (from one or more CEFs) to a serialised sequence
of bytes.  Serialisation is important for transmission and storage.

It's important not to confuse a CEF and a CES.  A CEF maps code points
to code units (numbers) but a CES maps code units to serialised
sequences of bytes.  Therefore a CES must take into account
byteordering.  A CES may also include more than one CCS, and it may
include non-textual data such as shift sequences.

Examples of CESs include UTF-8 (yes, UTF-8 is both a CEF and a CES),
UTF-16BE, and UTF-16LE from Unicode, and the Asian ISO 2022 and EUC.

=item I

I - a reversible transform of encoded data,
which may or may not contain textual data.  These include various
Internet or other transmission/storage transformation to either
encapsulate/protect data for transmission or to compress it.

Examples include base64, uuencode, quoted-printable, zip, bzip2.

=back

In addition to the above: I - I - is a mapping that
combines ACR, CCS, CES and a CEF.  Character Maps are what in the IAB
architecture get the IANA charset identifiers.  TES is assumed