Re: More character matching bits

2001-06-15 Thread Simon Cozens
On Fri, Jun 15, 2001 at 07:12:45PM -0400, Dan Sugalski wrote: > The question, then, is should ya be considered a literal number in either > of those contexts? The phrase "in those contexts" suggests that it should in some and shouldn't in others. This means that the regexp engine would need to u

Re: More character matching bits

2001-06-15 Thread Dan Sugalski
At 12:03 AM 6/16/2001 +0100, Simon Cozens wrote: >On Fri, Jun 15, 2001 at 06:58:24PM -0400, Dan Sugalski wrote: > > The kanji dictionary I have handy gives non-numeric translations for > > several of the numeric kanji, though it might be something that gets lost > > in translation. > >Ah, OK; sure

Re: More character matching bits

2001-06-15 Thread Simon Cozens
On Fri, Jun 15, 2001 at 06:58:24PM -0400, Dan Sugalski wrote: > The kanji dictionary I have handy gives non-numeric translations for > several of the numeric kanji, though it might be something that gets lost > in translation. Ah, OK; sure, there can be numerics with non-numeric meanings, but n

Re: More character matching bits

2001-06-15 Thread Dan Sugalski
At 12:29 AM 6/16/2001 +0100, Simon Cozens wrote: >On Fri, Jun 15, 2001 at 07:12:45PM -0400, Dan Sugalski wrote: > > The question, then, is should ya be considered a literal number in either > > of those contexts? > >The phrase "in those contexts" suggests that it should in some and shouldn't >in o

Re: More character matching bits

2001-06-15 Thread Dan Sugalski
At 11:28 PM 6/15/2001 +0100, Simon Cozens wrote: >On Fri, Jun 15, 2001 at 11:50:49AM -0400, Dan Sugalski wrote: > > Unless I'm missing something (Simon? Hong?) Japanese (and potentially all > > the languages that use the Han characters) can interpret a particular > > character as either a number o

Re: More character matching bits

2001-06-15 Thread Bryan C . Warnock
On Friday 15 June 2001 06:58 pm, Dan Sugalski wrote: > > > >module Locale::Hawaiian; > > > >use re 'class (\w => [aeiouâêîôûhklmnpw`])'; > > > >... > > > > > > Sure. I expect Damian will write us something that lets you specify > > > them upside-down in Klingon or something by the time this is don

Re: More character matching bits

2001-06-15 Thread Simon Cozens
On Fri, Jun 15, 2001 at 07:45:58PM -0400, Dan Sugalski wrote: > If we can't effectively do it correctly, I can live with that. I just want > the suboptimal behaviour to be on purpose (and hopefully overridable by > someone clever enough) rather than accidental. As I've intimated in the past, I

Re: More character matching bits

2001-06-15 Thread Simon Cozens
On Fri, Jun 15, 2001 at 11:50:49AM -0400, Dan Sugalski wrote: > Unless I'm missing something (Simon? Hong?) Japanese (and potentially all > the languages that use the Han characters) can interpret a particular > character as either a number or not a number, depending on context. Uh, don't think

Re: More character matching bits

2001-06-15 Thread Bart Lateur
On Fri, 15 Jun 2001 06:52:32 -0400, Bryan C. Warnock wrote: >On a side note (and this *will* sound stupid, but there is a reason I'm >asking). Why is there no logical opposite to '.'; that is, a character >which never matches another character? (Besides, of course, that it's >utterly useless

Re: More character matching bits

2001-06-15 Thread Dan Sugalski
At 06:52 AM 6/15/2001 -0400, Bryan C. Warnock wrote: >On Thursday 14 June 2001 12:01 pm, Dan Sugalski wrote: > > As I see it, locales specify: > > > >* Collating order > >* Comparison/equality specification > >* Unicode codepoint interpretation > >What do you mean by that? Unless I'm

Re: More character matching bits

2001-06-15 Thread Bryan C . Warnock
On Thursday 14 June 2001 12:01 pm, Dan Sugalski wrote: > Fancy character classes are probably enough to handle the various casing > issues and their analogs. They're probably not enough to handle things > like the arabic tatwheel, or proper word breaks in most asian languages. > Heck, unless I'm m

Re: More character matching bits

2001-06-14 Thread Dan Sugalski
At 01:10 PM 6/14/2001 +0200, Bart Lateur wrote: >On Wed, 13 Jun 2001 13:39:16 -0400, Dan Sugalski wrote: > > >> > Something that should be part of the core? I'll leave > >> >that for you to decide. > >> > >>Most definitely NOT. > > > >Most definitely sort of. > > > >>There is no reason to put fucn

Re: More character matching bits

2001-06-14 Thread Bryan C . Warnock
On Thursday 14 June 2001 07:10 am, Bart Lateur wrote: > If you're saying that the perl core shsould include hooks into the regex > engine for custom character classes, I agree. But nothing more. > Currently, Perl5 provides a hook for "use locale;", but I wish there was > something more general tha

Re: More character matching bits

2001-06-14 Thread Bart Lateur
On Wed, 13 Jun 2001 13:39:16 -0400, Dan Sugalski wrote: >> > Something that should be part of the core? I'll leave >> >that for you to decide. >> >>Most definitely NOT. > >Most definitely sort of. > >>There is no reason to put fucntionality for free matching of Japanese >>characters into the basi

Re: More character matching bits

2001-06-13 Thread Dan Sugalski
At 05:15 PM 6/13/2001 +0200, Bart Lateur wrote: >On Wed, 13 Jun 2001 01:22:32 +0100, Simon Cozens wrote: > > > Something that should be part of the core? I'll leave > >that for you to decide. > >Most definitely NOT. Most definitely sort of. >There is no reason to put fucntionality for free match

Re: More character matching bits

2001-06-13 Thread Bart Lateur
On Wed, 13 Jun 2001 01:22:32 +0100, Simon Cozens wrote: > Something that should be part of the core? I'll leave >that for you to decide. Most definitely NOT. There is no reason to put fucntionality for free matching of Japanese characters into the basic perl executable. There were already voice

Re: More character matching bits

2001-06-12 Thread Jarkko Hietaniemi
For reference, here's how Perl 5.8 will define \p{IsFoo} character classes: # 005F: SPACING UNDERSCROE ['IsWord', '$cat =~ /^[LMN]/ or $code eq "005F"', ''], ['IsAlnum', '$cat =~ /^[LMN]/',''], ['IsAlpha', '$cat =~ /^[LM]/', ''], # 0009: HORIZONTAL TABULATION #

Re: More character matching bits

2001-06-12 Thread Bryan C . Warnock
On Wednesday 13 June 2001 12:23 am, Jarkko Hietaniemi wrote: > > RE Feature Override Create New > > > > switches 'i' only yes > > anchorsno no > > (I would call them assertions.) Bzzt. > Another gig for Bean. > >

Re: More character matching bits

2001-06-12 Thread Jarkko Hietaniemi
> RE Feature Override Create New > > switches 'i' only yes > anchorsno no (I would call them assertions.) Bzzt. > - Anchors. ^,$,\A,\Z,\z,\b, \G. Since the definition of a line (see 'm' > and 's' above) isn't

Re: More character matching bits

2001-06-12 Thread Bryan C . Warnock
On Tuesday 12 June 2001 10:58 pm, Bryan C. Warnock wrote: > On Tuesday 12 June 2001 09:16 pm, Simon Cozens wrote: > > On Tue, Jun 12, 2001 at 05:41:40PM -0700, Hong Zhang wrote: > > > We should let external collator to handle all these fancy features. > > > > Phew, I've been saying this all along.

Re: More character matching bits

2001-06-12 Thread Jarkko Hietaniemi
> I think, following my line of thought, that [a-\N{KATAKANA LETTER KI}] > should be equivalent to [\x{0061}-\x{30AD}], which would match any of I think it should be an error. If you mean the code points write the code points. Mixing symbolic names (KATAKANA LETTER KI) and native characters (th

Re: More character matching bits

2001-06-12 Thread Buddha Buck
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > > Perl came from ASCII-centric roots, so it's likely that most of our > > biases are ASCII-centric. And for a couple of reasons, it's going to > > be hard to deal with that: > > > > 1. Backwards compatability with existing Perl practice, > > > >

Re: More character matching bits

2001-06-12 Thread Bryan C . Warnock
On Tuesday 12 June 2001 11:06 pm, Jarkko Hietaniemi wrote: > > I. Make ranges work on Unicode code-points (if they don't already). > > U, yes, they do, if you by code-point ranges mean \x{...}-\x{...} > but in general I would like to discourage the use of ranges. What do > you think [a-\N{KAT

Re: More character matching bits

2001-06-12 Thread Jarkko Hietaniemi
> Perl came from ASCII-centric roots, so it's likely that most of our > biases are ASCII-centric. And for a couple of reasons, it's going to > be hard to deal with that: > > 1. Backwards compatability with existing Perl practice, > > and > > 2. To do language-neutral right is -really- hard; lo

Re: More character matching bits

2001-06-12 Thread Bryan C . Warnock
On Tuesday 12 June 2001 09:16 pm, Simon Cozens wrote: > On Tue, Jun 12, 2001 at 05:41:40PM -0700, Hong Zhang wrote: > > We should let external collator to handle all these fancy features. > > Phew, I've been saying this all along. :) I think we've *all* been saying that. We just need to determin

Re: More character matching bits

2001-06-12 Thread Buddha Buck
Dan Sugalski <[EMAIL PROTECTED]> writes: > > We probably also ought to answer the question "How accommodating to > non-latin writing systems are we going to be?" It's an uncomfortable > question, but one that needs asking. Answering by Larry, probably, but > definitely asking. Perl's not real

Re: More character matching bits

2001-06-12 Thread Simon Cozens
We've pretty much run this subthread out of Perl content by now, so it ought to stop here, and I should start exercising some of that "restraint" thing. (Does it grow if you exercise it?) So Damien, we can take it to private mail or to sci.lang.japan or something, but if you promise to stop diggi

Re: More character matching bits

2001-06-12 Thread Simon Cozens
On Tue, Jun 12, 2001 at 06:45:31PM -0700, Damien Neil wrote: > > Hrm, no, not usually; furigana are almost always hiragana, and > > learner's textbooks - bah, they're not real Japanese. :) > > I believe you are confused; *cough*. I believe I am not. But who am I? Let's ask Kenkyusha - admittedly

Re: More character matching bits

2001-06-12 Thread Damien Neil
On Wed, Jun 13, 2001 at 02:15:16AM +0100, Simon Cozens wrote: > Or we could keep it out of core. It's up to you, really. No, it isn't. It's up to Larry, or to whoever gets the regex pumpkin. I'm withdrawing from this discussion: My intent was to clarify exactly why someone might want to treat K

RE: More character matching bits

2001-06-12 Thread Grant Mongardi
On Tue, Jun 12, 2001 at 06:44:02PM -0400, Dan Sugalski wrote: > We probably also ought to answer the question "How accommodating to > non-latin writing systems are we going to be?" What if Perl 6 simply reserved tags for extensions? This could assume processing similar to Perl 5 for compatibility

Re: More character matching bits

2001-06-12 Thread Simon Cozens
On Tue, Jun 12, 2001 at 05:41:40PM -0700, Hong Zhang wrote: > We should let external collator to handle all these fancy features. Phew, I've been saying this all along. :) > Please note regex is O(n) at best, adding an external collator > will make is O(2n). While this is very true, I think con

Re: More character matching bits

2001-06-12 Thread Simon Cozens
On Tue, Jun 12, 2001 at 05:40:32PM -0700, Damien Neil wrote: > The ability to match Hiragana as Katakana and vice-versa is almost > identical conceptually to the ability to perform case insensitive > matches on English text. I am going to choose not to disagree with you on this, but... > > What

Re: More character matching bits

2001-06-12 Thread Jarkko Hietaniemi
On Tue, Jun 12, 2001 at 05:41:40PM -0700, Hong Zhang wrote: > > We should let external collator to handle all these fancy features. > People can always normalize/canonicalize/do-whatever-you-want > and send the result text/binary to regex. All the features we > argue about here can be easily done

Re: More character matching bits

2001-06-12 Thread Damien Neil
On Wed, Jun 13, 2001 at 01:22:32AM +0100, Simon Cozens wrote: > I'd say it was about as useful as providing a regexp option to translate > the search term into French and try that instead.[1] Handy, possibly. > Essential? No. Something that should be part of the core? I'll leave > that for you to

RE: More character matching bits

2001-06-12 Thread Hong Zhang
We should let external collator to handle all these fancy features. People can always normalize/canonicalize/do-whatever-you-want and send the result text/binary to regex. All the features we argue about here can be easily done by a customized collator. Do NOT expect the Perl regex be a linguist

Re: More character matching bits

2001-06-12 Thread Simon Cozens
On Tue, Jun 12, 2001 at 05:03:17PM -0700, Damien Neil wrote: > I can say that I feel that providing a mechanism for Hiragana > characters to match Katakana and vice-versa is about as useful for a > person doing Japanese text processing as case-insensitive matching is > for a person working with En

Re: More character matching bits

2001-06-12 Thread Damien Neil
On Tue, Jun 12, 2001 at 06:44:02PM -0400, Dan Sugalski wrote: > While that's true, KATAKANA LETTER A and HIRAGANA LETTER A are also > referring to distinct things. (Though arguably not as distinct as either > with LATIN CAPITAL A) If we do one, why not the other? I'm perfectly happy > with an a

Re: More character matching bits

2001-06-12 Thread Dan Sugalski
At 03:12 PM 6/11/2001 -0700, Damien Neil wrote: >On Mon, Jun 11, 2001 at 05:03:26PM -0400, Dan Sugalski wrote: > > I don't think just /i should do that, as it seems rather extreme. (If you > > took that argument, it would seem to follow that KATAKANA LETTER A matches > > LATIN CAPITAL A, and I don

Re: More character matching bits

2001-06-11 Thread Damien Neil
On Mon, Jun 11, 2001 at 05:03:26PM -0400, Dan Sugalski wrote: > I don't think just /i should do that, as it seems rather extreme. (If you > took that argument, it would seem to follow that KATAKANA LETTER A matches > LATIN CAPITAL A, and I don't think we want to go there) The actual > perl-leve

Re: More character matching bits

2001-06-11 Thread Bryan C . Warnock
On Monday 11 June 2001 04:54 pm, Dan Sugalski wrote: > >Would it, or should it, be possible to tell m// to treat Katakana > >characters as the same as hiragana characters, in much the same way as > >m//i treats UPPERCASE the same as lowercase? Canonicalization won't get > >you that. > > Yup, that

Re: More character matching bits

2001-06-11 Thread Dan Sugalski
At 01:52 PM 6/11/2001 -0700, Damien Neil wrote: >In Japanese, ka and KA are two ways of writing the same syllable, in >much the same way that LATIN CAPITAL LETTER A and LATIN SMALL LETTER A >are. (Perhaps this is an argument for the /i modifier to apply to >more than just case?) I don't think ju

Re: More character matching bits

2001-06-11 Thread Damien Neil
On Mon, Jun 11, 2001 at 01:14:37PM -0700, Russ Allbery wrote: > Dan Sugalski <[EMAIL PROTECTED]> writes: > > I don't think canonicalization should do this. (I really hope not) This > > isn't really a canonicalization matter--words written with one character > > set aren't (AFAIK) the same as words

Re: More character matching bits

2001-06-11 Thread Dan Sugalski
At 04:46 PM 6/11/2001 -0400, Buddha Buck wrote: >At 01:14 PM 06-11-2001 -0700, Russ Allbery wrote: >>Dan Sugalski <[EMAIL PROTECTED]> writes: >> > At 01:05 PM 6/11/2001 -0700, Russ Allbery wrote: >> >> Dan Sugalski <[EMAIL PROTECTED]> writes: >> >> >>> Should perl's regexes and other character com

Re: More character matching bits

2001-06-11 Thread Jarkko Hietaniemi
On Mon, Jun 11, 2001 at 01:05:43PM -0700, Russ Allbery wrote: > Dan Sugalski <[EMAIL PROTECTED]> writes: > > > Should perl's regexes and other character comparison bits have an option > > to consider different characters for the same thing as identical beasts? > > I'm thinking in particular of t

Re: More character matching bits

2001-06-11 Thread Buddha Buck
At 01:14 PM 06-11-2001 -0700, Russ Allbery wrote: >Dan Sugalski <[EMAIL PROTECTED]> writes: > > At 01:05 PM 6/11/2001 -0700, Russ Allbery wrote: > >> Dan Sugalski <[EMAIL PROTECTED]> writes: > > >>> Should perl's regexes and other character comparison bits have an > >>> option to consider differen

Re: More character matching bits

2001-06-11 Thread Russ Allbery
Dan Sugalski <[EMAIL PROTECTED]> writes: > At 01:05 PM 6/11/2001 -0700, Russ Allbery wrote: >> Dan Sugalski <[EMAIL PROTECTED]> writes: >>> Should perl's regexes and other character comparison bits have an >>> option to consider different characters for the same thing as >>> identical beasts? I'

Re: More character matching bits

2001-06-11 Thread Dan Sugalski
At 01:05 PM 6/11/2001 -0700, Russ Allbery wrote: >Dan Sugalski <[EMAIL PROTECTED]> writes: > > > Should perl's regexes and other character comparison bits have an option > > to consider different characters for the same thing as identical beasts? > > I'm thinking in particular of the Katakana/Hira

Re: More character matching bits

2001-06-11 Thread Russ Allbery
Dan Sugalski <[EMAIL PROTECTED]> writes: > Should perl's regexes and other character comparison bits have an option > to consider different characters for the same thing as identical beasts? > I'm thinking in particular of the Katakana/Hiragana bits of japanese, > but other languages may have th

More character matching bits

2001-06-11 Thread Dan Sugalski
(I really need to pick up a printed copy of the 3.1 standard and set aside a day and a bottle of aspirin, but until then...) Should perl's regexes and other character comparison bits have an option to consider different characters for the same thing as identical beasts? I'm thinking in particu