hi
attached a patch for languages/PIR fixing:
* added optional "utf8:" encoding specifier (according to
docs/imcc/syntax.pod)
* fixed support for binary and hex. numbers
* added test for these changes.
regards,
klaas-jan
Index: languages/PIR/lib/pir.pg
Bernhard Schmalhofer <[EMAIL PROTECTED]> wrote:
> with 'make testr' I get a single test failure. My guess is that the
> string encoding is not properly written to the dumped PBC file:
Fixed.
leo
Bernhard Schmalhofer <[EMAIL PROTECTED]> wrote:
> Hi,
> with 'make testr' I get a single test failure. My guess is that the string
> encoding is
> not properly written to the dumped PBC file:
Yep. That's still missing.
Thanks,
leo
s is that the string
encoding is
not properly written to the dumped PBC file:
[EMAIL PROTECTED]:~/devel/Parrot/cvs/parrot> cat t/op/string_cs_2.pasm
set S0, ascii:"ok 1\n"
charset I0, S0
charsetname S1, I0
print S1
print "\n"
end
[EMAIL PROTECTED]:~/dev
Dan Sugalski <[EMAIL PROTECTED]> writes:
>> >substr($foo, 233253, 14)
>> >
>> > is going to cost significantly more with variable sized characters than
>> > fixed sized ones.
>>
>>I don't believe so.
>
>Then you would be incorrect. To find the character at position 233253 in a
>variable-lengt
On the subject of Unicode string processing...
I'm not a perl internals hacker and more of a passive reader of these
lists than an active contributor.
With that caveat, may I humbly point out a design document for
what I think is a clean C library supporting the use of mixed
encoding forms. I
On Fri, Feb 16, 2001 at 04:51:14PM -0800, Hong Zhang wrote:
> Yes and no. You can use for eq(), but not for cmp(). On little endian
> machine, the memcmp() will first compare the least significant byte, not
> most.
We'd use a custom cmp in the vtable in that case anyway.
--
I often think I'd g
At 06:47 PM 2/16/2001 -0800, Hong Zhang wrote:
>I like to wrap up my argument.
>
>I recommend to use UTF-8 as the sole string encoding.
>If we end up with multiple encodings, there is absolutely
>no point for this argument.
Um, I hate to point this out, but perl isn't go
I like to wrap up my argument.
I recommend to use UTF-8 as the sole string encoding.
If we end up with multiple encodings, there is absolutely
no point for this argument.
Benefits of UTF-8 is more compact, less encoding conversion,
more friendly to C API. UTF-16 is variable length encoding
too
> > I think you already mixed the codepoint vc character. What you will get
is
> > 10th codepoint, not 10th character.
>
> I think you're confused. Codepoints *are* characters. Combining characters
are
> taken care of as per the RFC.
If you define that way, I can agree with it. Since you still ha
> On Fri, Feb 16, 2001 at 02:39:10PM -0800, Hong Zhang wrote:
> > But you can not use memcmp() to compare binary order of two UTF-32
> > strings on little endian machines, even both strings are using
> > the same endian.
>
> Yes, you can.
Yes and no. You can use for eq(), but not for cmp(). On
On Fri, Feb 16, 2001 at 02:25:59PM -0800, Hong Zhang wrote:
> I think you already mixed the codepoint vc character. What you will get is
> 10th codepoint, not 10th character.
I think you're confused. Codepoints *are* characters. Combining characters are
taken care of as per the RFC.
> The UTF-32
On Fri, Feb 16, 2001 at 02:39:10PM -0800, Hong Zhang wrote:
> But you can not use memcmp() to compare binary order of two UTF-32
> strings on little endian machines, even both strings are using
> the same endian.
Yes, you can.
> BTW, with UTF-8, you never worry about endian issue.
*cough*. I kn
On Fri, Feb 16, 2001 at 01:20:26PM -0800, Hong Zhang wrote:
> But the memcmp() can not be used for UTF-32 string comparison, because
> of endian issue.
What endian issue? If you have two differently-endian strings being
compared at the C level, you have *far* bigger design problems
than the choi
> > I have already given the counter argument. The codepoint position is
useless
> > in many cases. They should be deprecated.
>
> Uh? That doesn't make sense. Codepoint position is *exactly* what people
> expect when they use substr. When I say
>
> $a = substr($b,10);
>
> I want the 10th char
> > But the memcmp() can not be used for UTF-32 string comparison, because
> > of endian issue.
>
> What endian issue? If you have two differently-endian strings being
> compared at the C level, you have *far* bigger design problems
> than the choice of UTF.
My argument was:
You can use memcmp(
Moved to -unicode, because that's what it's *for*.
On Fri, Feb 16, 2001 at 01:17:03PM -0800, Hong Zhang wrote:
> > substr's already been mentioned.
>
> I have already given the counter argument. The codepoint position is useless
> in many cases. They should be deprecated.
Uh? That doesn't make
On Friday 16 February 2001 16:20, Hong Zhang wrote:
> > And address arithmetic and mem(cmp|cpy) is faster than array iteration.
>
> Ha Ha Ha. You must be kidding.
>
> The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy.
> But the memcmp() can not be used for UTF-32 string compari
> >Did it buy you much? I don't believe so. Can you give some examples why
> >random character access is so important? Most people are processing text
> >linearly.
>
> Most, but not all. And as this is the internals list, we have to deal with
> all. We can't choose a convenient subset and ignore t
> Then you would be incorrect. To find the character at position 233253 in a
> variable-length encoding requires scanning the string from the beginning,
> and has a rather significant potential cost. You've got a test for every
> character up to that point with a potential branch or two on each on
> And address arithmetic and mem(cmp|cpy) is faster than array iteration.
Ha Ha Ha. You must be kidding.
The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy.
But the memcmp() can not be used for UTF-32 string comparison, because
of endian issue.
Hong
> substr's already been mentioned.
I have already given the counter argument. The codepoint position is useless
in many cases. They should be deprecated.
> Regular expressions. Perl does rather a lot of them. We've already found
from
> Perl 5 development that they get nasty when variable length
At 12:20 PM 2/16/2001 -0800, Hong Zhang wrote:
> > >People in Japan/China/Korea have been using multi-byte encoding for
> > >long time. I personally have used it for more 10 years. I never feel
> > >much of the "pain". Do you think I are using my computer with O(n)
> > >while you are using it with
At 12:32 PM 2/16/2001 -0800, Hong Zhang wrote:
> > > What do you mean? Have you seen people using multi-byte encoding
> > > in Japan/China/Korea?
> >
> > You're talking to the wrong person. Japanese data handling is my graduate
> > dissertation. :)
> >
> > The Unified Hangul/Kanji/Ha'nzi' Characte
On Friday 16 February 2001 15:35, Simon Cozens wrote:
> On Fri, Feb 16, 2001 at 12:32:10PM -0800, Hong Zhang wrote:
> > Did it buy you much? I don't believe so. Can you give some examples why
> > random character access is so important?
>
> substr's already been mentioned.
>
> Regular expression
On Fri, Feb 16, 2001 at 12:32:10PM -0800, Hong Zhang wrote:
> Did it buy you much? I don't believe so. Can you give some examples why
> random character access is so important?
substr's already been mentioned.
Regular expressions. Perl does rather a lot of them. We've already found from
Perl 5 d
> > What do you mean? Have you seen people using multi-byte encoding
> > in Japan/China/Korea?
>
> You're talking to the wrong person. Japanese data handling is my graduate
> dissertation. :)
>
> The Unified Hangul/Kanji/Ha'nzi' Characters in Unicode (so-called
"Unihan")
> occupy one and only one
> >People in Japan/China/Korea have been using multi-byte encoding for
> >long time. I personally have used it for more 10 years. I never feel
> >much of the "pain". Do you think I are using my computer with O(n)
> >while you are using it with O(1)? There are 100 million people using
> >variable-l
Dan Sugalski wrote:
> At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
> >People in Japan/China/Korea have been using multi-byte encoding for
> >long time. I personally have used it for more 10 years. I never feel
> >much of the "pain". Do you think I are using my computer with O(n)
> >while you are
On Fri, Feb 16, 2001 at 10:24:51AM -0300, Branden wrote:
> Yes, for UTF-16 it is. For UTF-32 it isn't
Yes, it damned well is.
You're confusing "codepoint" with "number of bytes in representation".
--
I would imagine most of the readers of this group would support abortion
as long as fifty or s
On Fri, Feb 16, 2001 at 12:26:43PM +, Simon Cozens wrote:
> On Fri, Feb 16, 2001 at 10:24:51AM -0300, Branden wrote:
> > Yes, for UTF-16 it is. For UTF-32 it isn't
>
> Yes, it damned well is.
I mean, no, it damned well isn't. But you probably guessed that.
> You're confusing "codepoint" wit
Simon Cozens wrote:
> On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
> > The concept of characters have nothing to do with codepoints.
> > Many characters are composed by more than one codepoints.
>
> This isn't true.
>
Yes, for UTF-16 it is. For UTF-32 it isn't, but unless you want
On Thu, Feb 15, 2001 at 04:55:00PM -0800, Hong Zhang wrote:
> > On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
> > > The concept of characters have nothing to do with codepoints.
> > > Many characters are composed by more than one codepoints.
> >
> > This isn't true.
>
> What do you
On Thu, Feb 15, 2001 at 05:09:45PM -0800, Hong Zhang wrote:
> People in Japan/China/Korea have been using multi-byte encoding for
> long time. I personally have used it for more 10 years.
And now you have a chance to not do so. Isn't that *nice*?
--
Term, holidays, term, holidays, till we leav
At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
> > ...and because of this you can't randomly access the string, you are
> > reduced to sequential access (*). And here I thought we could have
> > left tape drives to the last millennium.
> >
> > (*) Yes, of course you could cache your sequential ac
> ...and because of this you can't randomly access the string, you are
> reduced to sequential access (*). And here I thought we could have
> left tape drives to the last millennium.
>
> (*) Yes, of course you could cache your sequential access so you only
> need to do it once, and build balance
> On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
> > The concept of characters have nothing to do with codepoints.
> > Many characters are composed by more than one codepoints.
>
> This isn't true.
What do you mean? Have you seen people using multi-byte encoding
in Japan/China/Korea
On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
> The concept of characters have nothing to do with codepoints.
> Many characters are composed by more than one codepoints.
This isn't true.
--
* DrForr digs around for a fresh IV drip bag and proceeds to hook up.
Coffee port.
Firewa
On Thu, Feb 15, 2001 at 11:16:29PM +, Simon Cozens wrote:
> On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
> > Personally I like the UTF-8 encoding. The solution to the
> > variable length can be handled by a special (virtual)
> > function like
>
> I'm expecting that the virtual,
> On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
> > Personally I like the UTF-8 encoding. The solution to the
> > variable length can be handled by a special (virtual)
> > function like
>
> I'm expecting that the virtual, internal representation will not
> be in a UTF but will simpl
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
> Personally I like the UTF-8 encoding. The solution to the
> variable length can be handled by a special (virtual)
> function like
I'm expecting that the virtual, internal representation will not
be in a UTF but will simply be an array
Hi, All,
I want to give some of my thougts about string encoding.
Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like
class String {
virtual UV iterate(/*inout*/ int* index);
};
So in typical string iteration, the
42 matches
Mail list logo