[EMAIL PROTECTED] (Russ Allbery) wrote on 22.01.02 in
[EMAIL PROTECTED]:
Kai Henningsen [EMAIL PROTECTED] writes:
A case that (in a slightly different context) recently came up on
alt.usage.german (I don't remember if this particular point was made,
but it belongs):
berliner -
On Wed, Jan 23, 2002 at 06:06:00PM +0200, Kai Henningsen wrote:
People do get confused sometimes.
I'm confused as to why this is still on p6i. Followups to alt.usage.german?
Thanks.
--
DESPAIR:
It's Always Darkest Just Before it Gets Pitch Black
Jarkko Hietaniemi [EMAIL PROTECTED] wrote:
There is no string type built out of native eight-bit bytes.
In the good ol'days, one could usefully use regexes on 8-bit binary data,
eg
open G, 'myfile.gif' or die;
read G, $buf, 8192 or die;
if ($buf =~ /^GIF89a\x08\x02/) {
.
where it was
On Mon, Jan 21, 2002 at 04:37:46PM +, Dave Mitchell wrote:
Jarkko Hietaniemi [EMAIL PROTECTED] wrote:
There is no string type built out of native eight-bit bytes.
In the good ol'days, one could usefully use regexes on 8-bit binary data,
eg
open G, 'myfile.gif' or die;
read G, $buf,
Jarkko Hietaniemi [EMAIL PROTECTED] wrote:
In the good ol'days, one could usefully use regexes on 8-bit binary data,
eg
open G, 'myfile.gif' or die;
read G, $buf, 8192 or die;
if ($buf =~ /^GIF89a\x08\x02/) {
.
where it was clear to everyone that we are checking
On Mon, Jan 21, 2002 at 05:09:06PM +, Dave Mitchell wrote:
Jarkko Hietaniemi [EMAIL PROTECTED] wrote:
In the good ol'days, one could usefully use regexes on 8-bit binary data,
eg
open G, 'myfile.gif' or die;
read G, $buf, 8192 or die;
if ($buf =~ /^GIF89a\x08\x02/) {
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee`]/
I disagree. The difference between 'e' and 'e`' is similar to 'c'
and 'C'. The Unicode
Yes, that's somewhat problematic. Making up a byte CEF would be
Wrong, though, because there is, by definition, no CCS to map, and
we would be dangerously close to conflating in CES, too...
ACR-CCS-CEF-CES. Read the character model. Understand the character
model. Embrace the character
From: Hong Zhang [mailto:[EMAIL PROTECTED]]
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee`]/
I disagree. The difference
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee`]/
I disagree. The difference between 'e' and 'e`' is similar to 'c'
and
Hong Zhang [EMAIL PROTECTED] writes:
I disagree. The difference between 'e' and 'e`' is similar to 'c'
and 'C'.
No, it's not.
In many languages, an accented character is a completely different letter.
It's alphabetized separately, it's pronounced differently, and there are
many words that
On Monday 21 January 2002 16:43, Russ Allbery wrote:
Changing the capitalization of C does not change the word.
Er, most of the time.
--
Bryan C. Warnock
[EMAIL PROTECTED]
: Monday, January 21, 2002 04:10 PM
Cc: [EMAIL PROTECTED]
Subject: RE: on parrot strings
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee
Bryan C Warnock [EMAIL PROTECTED] writes:
On Monday 21 January 2002 16:43, Russ Allbery wrote:
Changing the capitalization of C does not change the word.
Er, most of the time.
No, pretty much all of the time. There are differences between proper
nouns and common nouns, but those are
On Monday 21 January 2002 17:11, Russ Allbery wrote:
No, pretty much all of the time. There are differences between proper
nouns and common nouns, but those are differences routinely quashed as a
typesetting decision; if you write both proper nouns and common nouns in
all caps as part of a
Honour where honour is due: I've got some questions about inversion
lists. Where I saw them mentioned by that name were some drafts of
this:
http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html
The book looks really promising-- unfortunately it's not yet published.
--
$jhi++;
I belive IBM use inversion lists in thier ICU library for sets of
unicode characters.
Graham.
On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote:
Honour where honour is due: I've got some questions about inversion
lists. Where I saw them mentioned by that name were some drafts
On Sat, Jan 19, 2002 at 07:08:25PM +0200, Jarkko Hietaniemi wrote:
http://www.aw.com/catalog/academic/product/1,4096,0201700522,00.html
The book looks really promising-- unfortunately it's not yet published.
Isn't this, uhm, http://www.concentric.net/~rtgillam/pubs/unibook/index.html ?
--
Jarkko Hietaniemi: from attachment
About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big. Secondly, for most cases, it would be wastefully
sparse. A balanced binary tree of (begin, end) points
Thanks, Jarrko.
On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.
Do you think/feel/wish/demand that the textual (string) APIs should differ
from the binary (byte) APIs? (Both from an
On Fri, Jan 18, 2002 at 04:51:07AM -0500, Bryan C. Warnock wrote:
Thanks, Jarrko.
On Thursday 17 January 2002 23:21, Jarkko Hietaniemi wrote:
The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.
Do you think/feel/wish/demand that the textual
Since I seem to be the main regex hacker for Parrot, I'll respond to
this as best I can.
Currently, we are using bitmaps for character classes. Well, sort of.
A Bitmap in Parrot is defined like this:
typedef struct bitmap_t {
char* bmp;
(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer granularity): the problem is that a lot of
space is wasted, since the
I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
This is true.
parse /[0-9]+/, why you want to convert everything to UTF-32.
On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully
decompose input string, strip off any combining character, and do 8-bit
Hmmm. The above sounds complicated not quite what I had in mind
for
My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did
measurement on it, to handle all unicode category, simple casing,
and decimal digit
On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote:
My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did
measurement on it, to
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two ints, or 8 bytes.
[Clarification from a noncombatant] You meant
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two
On Fri, Jan 18, 2002 at 01:40:26PM -0800, Steve Fink wrote:
On Fri, Jan 18, 2002 at 10:08:40PM +0200, Jarkko Hietaniemi wrote:
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). Encoding the whole range is a non-issue
bordering on a joke: two
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless there already is one, in which case delete
the 0 (and shift the list
On Sat, Jan 19, 2002 at 12:28:15AM +0200, Jarkko Hietaniemi wrote:
On Fri, Jan 18, 2002 at 02:22:49PM -0800, Steve Fink wrote:
On Sat, Jan 19, 2002 at 12:11:06AM +0200, Jarkko Hietaniemi wrote:
Complement of an inversion list is neat: insert 0 at the beginning
(and append max+1), unless
We *do* want to have (with some notation)
[[:digit:]\p{FunkyLooking}aeiou except 7], right?
Of course. But that is all resolvable in regex compile time.
No expression tree needed.
My point was that if inversion lists are insufficient for describing
all the character classes we
On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:
As for character encodings, we're forcing everything to UTF-32 in
regular expressions. No exceptions. If you use a string in a regex,
it'll be transcoded. I honestly can't think of a better way to
guarantee efficient
Hong Zhang [EMAIL PROTECTED] writes:
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully
decompose input string, strip off any combining character, and do 8-bit
Hmmm. The above sounds complicated not
36 matches
Mail list logo