Re: Unicode handling

2001-03-27 Thread Larry Wall
Dan Sugalski writes: : I'm not sure that raw's the right word, given that the data is really : Unicode. It's not raw in the sense that a JPEG image or executable is raw data. I'm suggesting it might be raw in that very sense, and simultaneously be perfectly valid "internal" Unicode. Otherwise y

Re: Unicode handling

2001-03-27 Thread Dan Sugalski
At 11:06 AM 3/27/2001 -0800, Larry Wall wrote: >Dan Sugalski writes: >: At 07:21 AM 3/27/2001 -0800, Larry Wall wrote: >: >Dan Sugalski writes: >: >Assume that in practice most of the normalization will be done by the >: >input disciplines. Then we might have a pragma that says to try to >: >enfo

Re: Unicode handling

2001-03-27 Thread Dan Sugalski
At 01:09 PM 3/27/2001 -0800, Hong Zhang wrote: > > The only problem with that is it means we'll be potentially altering the > > data as it comes in, which leads back to the problem of input and output > > files not matching for simple filter programs. (Plus it means we spend CPU > > cycles alterin

Re: Unicode handling

2001-03-27 Thread Hong Zhang
> The only problem with that is it means we'll be potentially altering the > data as it comes in, which leads back to the problem of input and output > files not matching for simple filter programs. (Plus it means we spend CPU > cycles altering data that we might not actually need to) > I don't t

Re: Unicode handling

2001-03-27 Thread Larry Wall
Dan Sugalski writes: : At 07:21 AM 3/27/2001 -0800, Larry Wall wrote: : >Dan Sugalski writes: : >Assume that in practice most of the normalization will be done by the : >input disciplines. Then we might have a pragma that says to try to : >enforce level 1, level 2, level 3 if your data doesn't ma

Re: Unicode handling

2001-03-27 Thread Damien Neil
On Tue, Mar 27, 2001 at 12:38:23PM -0500, Dan Sugalski wrote: > I'm afraid this isn't what I'd normally think of--ord to me returns the > integer value of the first code point in the string. That does mean that A > is different for ASCII and EBCDIC, but that's just One Of Those Things. My perso

Re: Unicode handling

2001-03-27 Thread Dan Sugalski
At 08:37 PM 3/26/2001 +, [EMAIL PROTECTED] wrote: >Damien Neil <[EMAIL PROTECTED]> writes: > >> >So $c = chr(ord($c)) could change $c? That seems odd. > >> > >> It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) > >> but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness.

Re: Unicode handling

2001-03-27 Thread Dan Sugalski
At 07:21 AM 3/27/2001 -0800, Larry Wall wrote: >Dan Sugalski writes: >: Fair enough. I think there are some cases where there's a base/combining >: pair of codepoints that don't map to a single combined-character code >: point. Not matching on a glyph boundary could make things really odd, but >:

Re: Unicode handling

2001-03-27 Thread Larry Wall
Garrett Goebel writes: : Someone please clue me in. A pointer to an RFC which defines the use of : colons in Perl6 among other things would help. Heh. If you read the RFCs, you'll discover one of the basic rules of language redesign: everybody wants the colon. And it never seems to occur to peo

Re: Unicode handling

2001-03-27 Thread Larry Wall
Dan Sugalski writes: : Fair enough. I think there are some cases where there's a base/combining : pair of codepoints that don't map to a single combined-character code : point. Not matching on a glyph boundary could make things really odd, but : I'd hate to have the checking code on by default,

RE: Unicode handling

2001-03-26 Thread Garrett Goebel
From: Damien Neil [mailto:[EMAIL PROTECTED]] > On Mon, Mar 26, 2001 at 08:37:05PM +, [EMAIL PROTECTED] wrote: > > > > > > If ord is dependent on the encoding of the string it gets, as Dan > > > was saying, than ord($e) is 0x81, > > > > It it could still be 0x81 (from ebcdic) with the encodin

Re: Unicode handling

2001-03-26 Thread Damien Neil
On Mon, Mar 26, 2001 at 08:37:05PM +, [EMAIL PROTECTED] wrote: > >If ord is dependent on the encoding of the string it gets, as Dan > >was saying, than ord($e) is 0x81, > > It it could still be 0x81 (from ebcdic) with the encoding carried > along with the _number_ if we thought that worth t

Re: Unicode handling

2001-03-26 Thread nick
Damien Neil <[EMAIL PROTECTED]> writes: >> >So $c = chr(ord($c)) could change $c? That seems odd. >> >> It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) >> but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness. >> Then of course someone will want it to be the number 0x45 a

Re: Unicode handling

2001-03-26 Thread Damien Neil
On Mon, Mar 26, 2001 at 06:16:00PM +, [EMAIL PROTECTED] wrote: > Damien Neil <[EMAIL PROTECTED]> writes: > >On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: > >> At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >> >So the results of ord are dependent on a global setting for "curr

Re: Unicode handling

2001-03-26 Thread nick
Dan Sugalski <[EMAIL PROTECTED]> writes: > >For length, I'd as soon it returned the number of code points, but glyphs >and bytes are also valid return values. And that may be where it belongs - at the language level chars($s) == 120 bytes($s) == 480 glyphs($s) == 360 length($

Re: Unicode handling

2001-03-26 Thread nick
Damien Neil <[EMAIL PROTECTED]> writes: >On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: >> At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >> >So the results of ord are dependent on a global setting for "current >> >character set" or some such, not on the encoding of the string that

Re: Unicode handling

2001-03-26 Thread Dan Sugalski
At 02:52 AM 3/25/2001 -0500, Philip Newton wrote: >On Fri, 23 Mar 2001, Dan Sugalski wrote: > > > At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote: > > >On Friday 23 March 2001 14:18, Dan Sugalski wrote: > > > > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > > >We need the character equiv

Re: Unicode handling

2001-03-26 Thread nick
Dan Sugalski <[EMAIL PROTECTED]> writes: >>This the main pain with 5.7.*'s EBCDIC scheme - making >> >>ord('A') == 193 >> >>true :-/ > >That would be true if EBCDIC was the default encoding, otherwise false. But what about our $var; { use encoding 'US-ascii'; $var = 'A'; } {use Encoding 'i

Re: Unicode handling

2001-03-26 Thread Dan Sugalski
At 04:34 PM 3/24/2001 -0800, Dave Storrs wrote: > I'll just toss my 0.01 cents in...my thought here is that this >thread has now tied up a lot of cycles from a lot of very smart, very >experienced people without resulting in an answer that is clearly The >Right Thing. Whatever we do, ther

Re: Unicode handling

2001-03-26 Thread Damien Neil
On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >So the results of ord are dependent on a global setting for "current > >character set" or some such, not on the encoding of the string that > >is passed to it? > > Nope, ord is depen

Re: Unicode handling

2001-03-26 Thread Dan Sugalski
At 05:45 PM 3/26/2001 +, [EMAIL PROTECTED] wrote: >Dan Sugalski <[EMAIL PROTECTED]> writes: > >At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > >>So the results of ord are dependent on a global setting for "current > >>character set" or some such, not on the encoding of the string that > >>is

RE: Unicode handling

2001-03-26 Thread Dan Sugalski
At 11:42 AM 3/26/2001 -0600, Garrett Goebel wrote: >From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > > > So the results of ord are dependent on a global setting for > > > "current character set" or some such, not on the encoding > > > of the strin

Re: Unicode handling

2001-03-26 Thread nick
Dan Sugalski <[EMAIL PROTECTED]> writes: >At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >>So the results of ord are dependent on a global setting for "current >>character set" or some such, not on the encoding of the string that >>is passed to it? > >Nope, ord is dependent on the string it gets,

RE: Unicode handling

2001-03-26 Thread Garrett Goebel
From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: > > So the results of ord are dependent on a global setting for > > "current character set" or some such, not on the encoding > > of the string that is passed to it? > > Nope, ord is dependent on the s

Re: Unicode handling

2001-03-26 Thread Dan Sugalski
At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: >So the results of ord are dependent on a global setting for "current >character set" or some such, not on the encoding of the string that >is passed to it? Nope, ord is dependent on the string it gets, as those strings know what their encoding is.

RE: Unicode handling

2001-03-26 Thread Dan Sugalski
At 09:09 AM 3/26/2001 -0600, Garrett Goebel wrote: >Someone please clue me in. A pointer to an RFC which defines the use of >colons in Perl6 among other things would help. > >Why not have subsequent uses of : on the same variable name perform a cast? >Or perhaps better returned the casted value?

RE: Unicode handling

2001-03-26 Thread Garrett Goebel
From: Dan Sugalski [mailto:[EMAIL PROTECTED]] > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > > > > For instance, chr() will produce Unicode codepoints. But > > you can pretend that they're ASCII codepoints, it's only > > the EBCDIC folk that'll get hurt. I hope and suspect > > there'll be an

Re: Unicode handling

2001-03-26 Thread Brad Hughes
Simon Cozens wrote: [...] > I'm just not sure it's fair on Old World hackers. Will there be a way to stop > Perl upgrading stuff to Unicode on the way in? and I'm probably not the only Old World hacker that would prefer a build option to simply eliminate Unicode support altogether...

Re: Unicode handling

2001-03-24 Thread Philip Newton
On Fri, 23 Mar 2001, Dan Sugalski wrote: > At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote: > >On Friday 23 March 2001 14:18, Dan Sugalski wrote: > > > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > >We need the character equivalence construct, such as [[=a=]], which > > > >matches "a",

Re: Unicode handling

2001-03-24 Thread Dave Storrs
On Fri, 23 Mar 2001, Dan Sugalski wrote: > At 11:41 PM 3/22/2001 +, Nicholas Clark wrote: > >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > hadn't thought of. If we do, then something as simple as this: > >while () { > $count++ if /bar/; > print OU

Re: Unicode handling

2001-03-24 Thread nick
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: >On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote: >> At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: >> > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: >> > >> > DS> U doesn't really signal "glyph" to me, but we are sort of limi

Re: Unicode handling

2001-03-23 Thread Damien Neil
On Fri, Mar 23, 2001 at 06:31:13PM -0500, Dan Sugalski wrote: > >Err, perhaps I'm being dumb here - but surely $foo and $bar arent > >typed strings, they're just numbers (or strings which match /^\d+$/) ??? > > D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord. > > chr will

Re: Unicode handling

2001-03-23 Thread Damien Neil
On Fri, Mar 23, 2001 at 06:16:58PM -0500, Dan Sugalski wrote: > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > >For instance, chr() will produce Unicode codepoints. But you can pretend that > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope > >and suspect there'll

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 11:26 PM 3/23/2001 +, Dave Mitchell wrote: >Dan Sugalski <[EMAIL PROTECTED]> doodled: > > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > > >For instance, chr() will produce Unicode codepoints. But you can > pretend that > > >they're ASCII codepoints, it's only the EBCDIC folk that'll g

Re: Unicode handling

2001-03-23 Thread Dave Mitchell
Dan Sugalski <[EMAIL PROTECTED]> doodled: > At 11:09 PM 3/23/2001 +, Simon Cozens wrote: > >For instance, chr() will produce Unicode codepoints. But you can pretend that > >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope > >and suspect there'll be an equivalent of

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 11:09 PM 3/23/2001 +, Simon Cozens wrote: >For instance, chr() will produce Unicode codepoints. But you can pretend that >they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope >and suspect there'll be an equivalent of "use bytes" which makes chr(256) >either blow up o

Re: Unicode handling

2001-03-23 Thread Simon Cozens
On Fri, Mar 23, 2001 at 03:15:41PM -0800, Brad Hughes wrote: > Simon Cozens wrote: > [...] > > I'm just not sure it's fair on Old World hackers. Will there be a way to stop > > Perl upgrading stuff to Unicode on the way in? > > and I'm probably not the only Old World hacker that would > prefe

Re: Unicode handling

2001-03-23 Thread Simon Cozens
On Fri, Mar 23, 2001 at 05:56:19PM -0500, Dan Sugalski wrote: > Nah, they only apply to data that perl's tagged as Unicode, either because > its input stream is marked that way or because the program explicitly > converted the data. Oh, colour me dull. I read 4) Data converted to Unicode (

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 10:48 PM 3/23/2001 +, Simon Cozens wrote: >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > Yes, I realize that point 5 may result in someone getting a meaningless > > Unicode string. Too bad--it is *not* the place of a programming > language to > > enforce validity on dat

Re: Unicode handling

2001-03-23 Thread Simon Cozens
On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > Yes, I realize that point 5 may result in someone getting a meaningless > Unicode string. Too bad--it is *not* the place of a programming language to > enforce validity on data. That's the programmer's job. But points 4 and 5 do en

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 01:07 PM 3/23/2001 -0800, Larry Wall wrote: >Jarkko Hietaniemi writes: >: *cough* \C *is* taken. >: >: > >also \U has a meaning in double quotish strings. >: >: "\Uindeed." > >Bear in mind we are redesigning the language. If there's a botch we >can think about fixing it. > >Though maybe not on

Re: Unicode handling

2001-03-23 Thread Larry Wall
Jarkko Hietaniemi writes: : *cough* \C *is* taken. : : > >also \U has a meaning in double quotish strings. : : "\Uindeed." Bear in mind we are redesigning the language. If there's a botch we can think about fixing it. Though maybe not on -internals... :-) Larry

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 08:14 PM 3/23/2001 +, Nicholas Clark wrote: >On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote: > > I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII > > character. > > > > \SMILEY FACE, perhaps? > >that makes it kind of hard to edit perl scripts that use

Re: Unicode handling

2001-03-23 Thread Nicholas Clark
On Fri, Mar 23, 2001 at 03:08:35PM -0500, Dan Sugalski wrote: > I'm half tempted, since this is a Unicode-only feature, to use a non-ASCII > character. > > \SMILEY FACE, perhaps? that makes it kind of hard to edit perl scripts that use this feature on any good old fashioned 8 bit xterm. Let alo

Re: Unicode handling

2001-03-23 Thread Bryan C. Warnock
On Friday 23 March 2001 14:48, you wrote > In Unicode, there's theoretically no locale. Theoretically... Well, yes, but Unicode makes no pretenses about encoding the world's languages - just the various symbols use by the world's languages. If you want to orient Perl so that it remains(?) data-

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 02:06 PM 3/23/2001 -0600, Jarkko Hietaniemi wrote: >On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote: > > At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > > > > > DS> U doesn't really signal "glyph" to me, but we are

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 11:52 AM 3/23/2001 -0800, Hong Zhang wrote: > > >I recommend to use 'u' flag, which indicates all operations are performed > > >against unicode grapheme/glyph. By default re is performed on codepoint. > > > > U doesn't really signal "glyph" to me, but we are sort of limited in what > > we have

Re: Unicode handling

2001-03-23 Thread Jarkko Hietaniemi
On Fri, Mar 23, 2001 at 02:50:05PM -0500, Dan Sugalski wrote: > At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > > > DS> U doesn't really signal "glyph" to me, but we are sort of limited > > DS> in what we have left. We still need a

Re: Unicode handling

2001-03-23 Thread Hong Zhang
> > >We need the character equivalence construct, such as [[=a=]], which > > >matches "a", "A ACUTE". > > > > Yeah, we really need a big list of these. PDD anyone? > > > > But surely this is a locale issue, and not an encoding one? Not every > language recognizes the same character equivalences

Re: Unicode handling

2001-03-23 Thread Hong Zhang
> >I recommend to use 'u' flag, which indicates all operations are performed > >against unicode grapheme/glyph. By default re is performed on codepoint. > > U doesn't really signal "glyph" to me, but we are sort of limited in what > we have left. We still need a zero-width assertion for glyph boun

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 02:31 PM 3/23/2001 -0500, Bryan C. Warnock wrote: >On Friday 23 March 2001 14:18, Dan Sugalski wrote: > > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > > > characters to match the word/non-word boundary ones we alre

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 02:27 PM 3/23/2001 -0500, Uri Guttman wrote: > > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: > > DS> U doesn't really signal "glyph" to me, but we are sort of limited > DS> in what we have left. We still need a zero-width assertion for > DS> glyph boundary within regexes themselv

Re: Unicode handling

2001-03-23 Thread Bryan C. Warnock
On Friday 23 March 2001 14:18, Dan Sugalski wrote: > At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > > characters to match the word/non-word boundary ones we already have. > > > >(While > > > > > I'd personally like \g and

Re: Unicode handling

2001-03-23 Thread Uri Guttman
> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> U doesn't really signal "glyph" to me, but we are sort of limited DS> in what we have left. We still need a zero-width assertion for DS> glyph boundary within regexes themselves. how about \C? it doesn't seem to be taken and would

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 01:30 PM 3/22/2001 -0800, Hong Zhang wrote: > > 6) There will be a glyph boundary/non-glyph boundary pair of regex > > characters to match the word/non-word boundary ones we already have. >(While > > I'd personally like \g and \G, that won't work as \G is already taken) > > > > I also realize t

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 10:56 AM 3/23/2001 -0800, Damien Neil wrote: >On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote: > >while () { > > $count++ if /bar/; > > print OUT $_; > >} > >I would find it surprising for this to have different output >than input. Other people's milage m

RE: Unicode handling

2001-03-23 Thread Dan Sugalski
At 01:26 PM 3/23/2001 -0500, NeonEdge wrote: >Dan Sugalski wrote: > >If we do, then something as simple as this: > > > > while () { > > $count++ if /bar/; > > print OUT $_; > > } > > > >would potentially result in the output file being rather different from the > >input file. E

RE: Unicode handling

2001-03-23 Thread Dan Sugalski
At 11:05 AM 3/23/2001 -0600, Garrett Goebel wrote: >From: Nicholas Clark [mailto:[EMAIL PROTECTED]] > > > > On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > > 1) All Unicode data perl does regular expressions against > > >will be in Normalization Form C, except for... > > > 2

Re: Unicode handling

2001-03-23 Thread Damien Neil
On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote: >while () { > $count++ if /bar/; > print OUT $_; >} I would find it surprising for this to have different output than input. Other people's milage may vary. In general, however, I think I would prefer to be

RE: Unicode handling

2001-03-23 Thread NeonEdge
Dan Sugalski wrote: >If we do, then something as simple as this: > > while () { > $count++ if /bar/; > print OUT $_; > } > >would potentially result in the output file being rather different from the >input file. Equivalent, yes, but different. Whether that's bad or not is an >

Re: Unicode handling

2001-03-23 Thread Dan Sugalski
At 11:41 PM 3/22/2001 +, Nicholas Clark wrote: >On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > 1) All Unicode data perl does regular expressions against will be in > > Normalization Form C, except for... > > 2) Regexes tagged to run against a decomposed form will instead be

RE: Unicode handling

2001-03-23 Thread Garrett Goebel
From: Nicholas Clark [mailto:[EMAIL PROTECTED]] > > On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > > 1) All Unicode data perl does regular expressions against > >will be in Normalization Form C, except for... > > 2) Regexes tagged to run against a decomposed form will > >

Re: Unicode handling

2001-03-22 Thread Nicholas Clark
On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote: > 1) All Unicode data perl does regular expressions against will be in > Normalization Form C, except for... > 2) Regexes tagged to run against a decomposed form will instead be run > against data in Normalization Form D. (What the ta

Re: Unicode handling

2001-03-22 Thread Hong Zhang
> 6) There will be a glyph boundary/non-glyph boundary pair of regex > characters to match the word/non-word boundary ones we already have. (While > I'd personally like \g and \G, that won't work as \G is already taken) > > I also realize that the decomposition flag on regexes would mean that > s/

Unicode handling

2001-03-22 Thread Dan Sugalski
At the moment, I'm not particularly inclined to argue unicode. Short of Larry handing down an edict and invoking Rule #1, the following rules will be in effect: 1) All Unicode data perl does regular expressions against will be in Normalization Form C, except for... 2) Regexes tagged to run aga