[PATCH] languages/PIR fix string encoding, hex and binary numbers

2007-02-08 Thread Klaas-Jan Stol
hi attached a patch for languages/PIR fixing: * added optional "utf8:" encoding specifier (according to docs/imcc/syntax.pod) * fixed support for binary and hex. numbers * added test for these changes. regards, klaas-jan Index: languages/PIR/lib/pir.pg

Re: [perl #34285] [BUG] Is string encoding written to PBC file

2005-03-01 Thread Leopold Toetsch
Bernhard Schmalhofer <[EMAIL PROTECTED]> wrote: > with 'make testr' I get a single test failure. My guess is that the > string encoding is not properly written to the dumped PBC file: Fixed. leo

Re: [perl #34285] [BUG] Is string encoding written to PBC file

2005-02-28 Thread Leopold Toetsch
Bernhard Schmalhofer <[EMAIL PROTECTED]> wrote: > Hi, > with 'make testr' I get a single test failure. My guess is that the string > encoding is > not properly written to the dumped PBC file: Yep. That's still missing. Thanks, leo

[perl #34285] [BUG] Is string encoding written to PBC file

2005-02-28 Thread via RT
s is that the string encoding is not properly written to the dumped PBC file: [EMAIL PROTECTED]:~/devel/Parrot/cvs/parrot> cat t/op/string_cs_2.pasm set S0, ascii:"ok 1\n" charset I0, S0 charsetname S1, I0 print S1 print "\n" end [EMAIL PROTECTED]:~/dev

Re: string encoding

2001-03-24 Thread nick
Dan Sugalski <[EMAIL PROTECTED]> writes: >> >substr($foo, 233253, 14) >> > >> > is going to cost significantly more with variable sized characters than >> > fixed sized ones. >> >>I don't believe so. > >Then you would be incorrect. To find the character at position 233253 in a >variable-lengt

Re: string encoding

2001-02-17 Thread Tom Lord
On the subject of Unicode string processing... I'm not a perl internals hacker and more of a passive reader of these lists than an active contributor. With that caveat, may I humbly point out a design document for what I think is a clean C library supporting the use of mixed encoding forms. I

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 04:51:14PM -0800, Hong Zhang wrote: > Yes and no. You can use for eq(), but not for cmp(). On little endian > machine, the memcmp() will first compare the least significant byte, not > most. We'd use a custom cmp in the vtable in that case anyway. -- I often think I'd g

Re: string encoding

2001-02-16 Thread Dan Sugalski
At 06:47 PM 2/16/2001 -0800, Hong Zhang wrote: >I like to wrap up my argument. > >I recommend to use UTF-8 as the sole string encoding. >If we end up with multiple encodings, there is absolutely >no point for this argument. Um, I hate to point this out, but perl isn't go

Re: string encoding

2001-02-16 Thread Hong Zhang
I like to wrap up my argument. I recommend to use UTF-8 as the sole string encoding. If we end up with multiple encodings, there is absolutely no point for this argument. Benefits of UTF-8 is more compact, less encoding conversion, more friendly to C API. UTF-16 is variable length encoding too

Re: string encoding

2001-02-16 Thread Hong Zhang
> > I think you already mixed the codepoint vc character. What you will get is > > 10th codepoint, not 10th character. > > I think you're confused. Codepoints *are* characters. Combining characters are > taken care of as per the RFC. If you define that way, I can agree with it. Since you still ha

Re: string encoding

2001-02-16 Thread Hong Zhang
> On Fri, Feb 16, 2001 at 02:39:10PM -0800, Hong Zhang wrote: > > But you can not use memcmp() to compare binary order of two UTF-32 > > strings on little endian machines, even both strings are using > > the same endian. > > Yes, you can. Yes and no. You can use for eq(), but not for cmp(). On

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 02:25:59PM -0800, Hong Zhang wrote: > I think you already mixed the codepoint vc character. What you will get is > 10th codepoint, not 10th character. I think you're confused. Codepoints *are* characters. Combining characters are taken care of as per the RFC. > The UTF-32

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 02:39:10PM -0800, Hong Zhang wrote: > But you can not use memcmp() to compare binary order of two UTF-32 > strings on little endian machines, even both strings are using > the same endian. Yes, you can. > BTW, with UTF-8, you never worry about endian issue. *cough*. I kn

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 01:20:26PM -0800, Hong Zhang wrote: > But the memcmp() can not be used for UTF-32 string comparison, because > of endian issue. What endian issue? If you have two differently-endian strings being compared at the C level, you have *far* bigger design problems than the choi

Re: string encoding

2001-02-16 Thread Hong Zhang
> > I have already given the counter argument. The codepoint position is useless > > in many cases. They should be deprecated. > > Uh? That doesn't make sense. Codepoint position is *exactly* what people > expect when they use substr. When I say > > $a = substr($b,10); > > I want the 10th char

Re: string encoding

2001-02-16 Thread Hong Zhang
> > But the memcmp() can not be used for UTF-32 string comparison, because > > of endian issue. > > What endian issue? If you have two differently-endian strings being > compared at the C level, you have *far* bigger design problems > than the choice of UTF. My argument was: You can use memcmp(

Re: string encoding

2001-02-16 Thread Simon Cozens
Moved to -unicode, because that's what it's *for*. On Fri, Feb 16, 2001 at 01:17:03PM -0800, Hong Zhang wrote: > > substr's already been mentioned. > > I have already given the counter argument. The codepoint position is useless > in many cases. They should be deprecated. Uh? That doesn't make

Re: string encoding

2001-02-16 Thread Bryan C . Warnock
On Friday 16 February 2001 16:20, Hong Zhang wrote: > > And address arithmetic and mem(cmp|cpy) is faster than array iteration. > > Ha Ha Ha. You must be kidding. > > The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy. > But the memcmp() can not be used for UTF-32 string compari

Re: string encoding

2001-02-16 Thread Hong Zhang
> >Did it buy you much? I don't believe so. Can you give some examples why > >random character access is so important? Most people are processing text > >linearly. > > Most, but not all. And as this is the internals list, we have to deal with > all. We can't choose a convenient subset and ignore t

Re: string encoding

2001-02-16 Thread Hong Zhang
> Then you would be incorrect. To find the character at position 233253 in a > variable-length encoding requires scanning the string from the beginning, > and has a rather significant potential cost. You've got a test for every > character up to that point with a potential branch or two on each on

Re: string encoding

2001-02-16 Thread Hong Zhang
> And address arithmetic and mem(cmp|cpy) is faster than array iteration. Ha Ha Ha. You must be kidding. The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy. But the memcmp() can not be used for UTF-32 string comparison, because of endian issue. Hong

Re: string encoding

2001-02-16 Thread Hong Zhang
> substr's already been mentioned. I have already given the counter argument. The codepoint position is useless in many cases. They should be deprecated. > Regular expressions. Perl does rather a lot of them. We've already found from > Perl 5 development that they get nasty when variable length

Re: string encoding

2001-02-16 Thread Dan Sugalski
At 12:20 PM 2/16/2001 -0800, Hong Zhang wrote: > > >People in Japan/China/Korea have been using multi-byte encoding for > > >long time. I personally have used it for more 10 years. I never feel > > >much of the "pain". Do you think I are using my computer with O(n) > > >while you are using it with

Re: string encoding

2001-02-16 Thread Dan Sugalski
At 12:32 PM 2/16/2001 -0800, Hong Zhang wrote: > > > What do you mean? Have you seen people using multi-byte encoding > > > in Japan/China/Korea? > > > > You're talking to the wrong person. Japanese data handling is my graduate > > dissertation. :) > > > > The Unified Hangul/Kanji/Ha'nzi' Characte

Re: string encoding

2001-02-16 Thread Bryan C . Warnock
On Friday 16 February 2001 15:35, Simon Cozens wrote: > On Fri, Feb 16, 2001 at 12:32:10PM -0800, Hong Zhang wrote: > > Did it buy you much? I don't believe so. Can you give some examples why > > random character access is so important? > > substr's already been mentioned. > > Regular expression

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 12:32:10PM -0800, Hong Zhang wrote: > Did it buy you much? I don't believe so. Can you give some examples why > random character access is so important? substr's already been mentioned. Regular expressions. Perl does rather a lot of them. We've already found from Perl 5 d

Re: string encoding

2001-02-16 Thread Hong Zhang
> > What do you mean? Have you seen people using multi-byte encoding > > in Japan/China/Korea? > > You're talking to the wrong person. Japanese data handling is my graduate > dissertation. :) > > The Unified Hangul/Kanji/Ha'nzi' Characters in Unicode (so-called "Unihan") > occupy one and only one

Re: string encoding

2001-02-16 Thread Hong Zhang
> >People in Japan/China/Korea have been using multi-byte encoding for > >long time. I personally have used it for more 10 years. I never feel > >much of the "pain". Do you think I are using my computer with O(n) > >while you are using it with O(1)? There are 100 million people using > >variable-l

Re: string encoding

2001-02-16 Thread Branden
Dan Sugalski wrote: > At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote: > >People in Japan/China/Korea have been using multi-byte encoding for > >long time. I personally have used it for more 10 years. I never feel > >much of the "pain". Do you think I are using my computer with O(n) > >while you are

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 10:24:51AM -0300, Branden wrote: > Yes, for UTF-16 it is. For UTF-32 it isn't Yes, it damned well is. You're confusing "codepoint" with "number of bytes in representation". -- I would imagine most of the readers of this group would support abortion as long as fifty or s

Re: string encoding

2001-02-16 Thread Simon Cozens
On Fri, Feb 16, 2001 at 12:26:43PM +, Simon Cozens wrote: > On Fri, Feb 16, 2001 at 10:24:51AM -0300, Branden wrote: > > Yes, for UTF-16 it is. For UTF-32 it isn't > > Yes, it damned well is. I mean, no, it damned well isn't. But you probably guessed that. > You're confusing "codepoint" wit

Re: string encoding

2001-02-16 Thread Branden
Simon Cozens wrote: > On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: > > The concept of characters have nothing to do with codepoints. > > Many characters are composed by more than one codepoints. > > This isn't true. > Yes, for UTF-16 it is. For UTF-32 it isn't, but unless you want

Re: string encoding

2001-02-16 Thread Simon Cozens
On Thu, Feb 15, 2001 at 04:55:00PM -0800, Hong Zhang wrote: > > On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: > > > The concept of characters have nothing to do with codepoints. > > > Many characters are composed by more than one codepoints. > > > > This isn't true. > > What do you

Re: string encoding

2001-02-16 Thread Simon Cozens
On Thu, Feb 15, 2001 at 05:09:45PM -0800, Hong Zhang wrote: > People in Japan/China/Korea have been using multi-byte encoding for > long time. I personally have used it for more 10 years. And now you have a chance to not do so. Isn't that *nice*? -- Term, holidays, term, holidays, till we leav

Re: string encoding

2001-02-15 Thread Dan Sugalski
At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote: > > ...and because of this you can't randomly access the string, you are > > reduced to sequential access (*). And here I thought we could have > > left tape drives to the last millennium. > > > > (*) Yes, of course you could cache your sequential ac

Re: string encoding

2001-02-15 Thread Hong Zhang
> ...and because of this you can't randomly access the string, you are > reduced to sequential access (*). And here I thought we could have > left tape drives to the last millennium. > > (*) Yes, of course you could cache your sequential access so you only > need to do it once, and build balance

Re: string encoding

2001-02-15 Thread Hong Zhang
> On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: > > The concept of characters have nothing to do with codepoints. > > Many characters are composed by more than one codepoints. > > This isn't true. What do you mean? Have you seen people using multi-byte encoding in Japan/China/Korea

Re: string encoding

2001-02-15 Thread Simon Cozens
On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: > The concept of characters have nothing to do with codepoints. > Many characters are composed by more than one codepoints. This isn't true. -- * DrForr digs around for a fresh IV drip bag and proceeds to hook up. Coffee port. Firewa

Re: string encoding

2001-02-15 Thread Jarkko Hietaniemi
On Thu, Feb 15, 2001 at 11:16:29PM +, Simon Cozens wrote: > On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: > > Personally I like the UTF-8 encoding. The solution to the > > variable length can be handled by a special (virtual) > > function like > > I'm expecting that the virtual,

Re: string encoding

2001-02-15 Thread Hong Zhang
> On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: > > Personally I like the UTF-8 encoding. The solution to the > > variable length can be handled by a special (virtual) > > function like > > I'm expecting that the virtual, internal representation will not > be in a UTF but will simpl

Re: string encoding

2001-02-15 Thread Simon Cozens
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: > Personally I like the UTF-8 encoding. The solution to the > variable length can be handled by a special (virtual) > function like I'm expecting that the virtual, internal representation will not be in a UTF but will simply be an array

string encoding

2001-02-15 Thread Hong Zhang
Hi, All, I want to give some of my thougts about string encoding. Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like class String { virtual UV iterate(/*inout*/ int* index); }; So in typical string iteration, the