Re: String rationale

2001-11-01 Thread Tom Hughes

In message [EMAIL PROTECTED]
Simon Cozens [EMAIL PROTECTED] wrote:

 On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote:
  The encoding_lookup() and chartype_lookup() routines will obviously
  need to load the relevant libraries on the fly when we have support
  for that.
 
 Could you try rewriting them using an enum, like the vtable stuff and
 the original string encoding stuff does?

The intention is that when an encoding or character type is loaded it
will be allocated a unique ID number that can be used internally to
refer to it, but that the number will only valid for the duration of
that instance of parrot rather than being persistent. That's certainly
the way Dan described it happening in his rationale which is what my
code is based on.

Allocating them globally is not possible if we're going allow people
to add arbitrary encodings and character sets - as things stand adding
the foo encoding will be as simple as adding foo.so to the encodings
directory.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu




Re: String rationale

2001-10-30 Thread Dan Sugalski

At 07:16 PM 10/29/2001 -0500, James Mastros wrote:
Yeah.  But that's a convention thing, I think.  I also think that most
people won't go to the bother of writing conversion functions that they
don't have to.  What we need to worry about is both, say, big5 and shiftjis
writing both of the conversions.  And it shouldn't come up all that much,
because Unicode is /supposted to be/ lossless for most things.

Supposed to be, yep. Whether it *is* or not is another issue entirely. :)

  I suspect that the encode and decode methods in the encoding vtable
  are enough for doing chr/ord aren't they?
Hmm... come to think of it, yes.  chr will always create a utf32-encoded
string with the given charset number (or unicode for the two-arg version),
ord will return the codepoint within the current charset.

Erk. No. chr should give you a string in the encoding you've selected, or 
the default encoding if you've not selected one. That may not be (probably 
won't be) UTF32.

(This, BTW, means that only encodings that feel like it have to provide
either, but all encodings must be able to convert to utf32.)

More or less, yep. Everyone has to go to UTF32. Direct encoding to encoding 
is optional. Encouraged in those cases where it's either quicker or less 
uncertain.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-29 Thread Tom Hughes

In message [EMAIL PROTECTED]
  Dan Sugalski [EMAIL PROTECTED] wrote:

 At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:

 Attached is my first pass at this - it's not fully ready yet but
 is something for people to cast an eye over before I spend lots of
 time going down the wrong path ;-)
 
 It looks pretty good on first glance.

I've done a bit more work now, and the latest version is attached.

This version can do transcoding. The intention is that there will be
some sort of cache in chartype_lookup_transcoder to avoid repeating
the expensive lookups by name too much.

One interesting question is who is responsible for transcoding
from character set A to character set B - is it A or B? and how
about the other way?

My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/


# This is a patch for parrot to update it to parrot-ns
# 
# To apply this patch:
# STEP 1: Chdir to the source directory.
# STEP 2: Run the 'applypatch' program with this patch file as input.
#
# If you do not have 'applypatch', it is part of the 'makepatch' package
# that you can fetch from the Comprehensive Perl Archive Network:
# http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz
# In the above URL, 'x' should be 2 or higher.
#
# To apply this patch without the use of 'applypatch':
# STEP 1: Chdir to the source directory.
# If you have a decent Bourne-type shell:
# STEP 2: Run the shell with this file as input.
# If you don't have such a shell, you may need to manually create/delete
# the files/directories as shown below.
# STEP 3: Run the 'patch' program with this file as input.
#
# These are the commands needed to create/delete files/directories:
#
mkdir 'chartypes'
chmod 0755 'chartypes'
mkdir 'encodings'
chmod 0755 'encodings'
rm -f 'transcode.c'
rm -f 'strutf8.c'
rm -f 'strutf32.c'
rm -f 'strutf16.c'
rm -f 'strnative.c'
rm -f 'include/parrot/transcode.h'
rm -f 'include/parrot/strutf8.h'
rm -f 'include/parrot/strutf32.h'
rm -f 'include/parrot/strutf16.h'
rm -f 'include/parrot/strnative.h'
touch 'chartype.c'
chmod 0644 'chartype.c'
touch 'chartypes/unicode.c'
chmod 0644 'chartypes/unicode.c'
touch 'chartypes/usascii.c'
chmod 0644 'chartypes/usascii.c'
touch 'encoding.c'
chmod 0644 'encoding.c'
touch 'encodings/singlebyte.c'
chmod 0644 'encodings/singlebyte.c'
touch 'encodings/utf16.c'
chmod 0644 'encodings/utf16.c'
touch 'encodings/utf32.c'
chmod 0644 'encodings/utf32.c'
touch 'encodings/utf8.c'
chmod 0644 'encodings/utf8.c'
touch 'include/parrot/chartype.h'
chmod 0644 'include/parrot/chartype.h'
touch 'include/parrot/encoding.h'
chmod 0644 'include/parrot/encoding.h'
#
# This command terminates the shell and need not be executed manually.
exit
#
 End of Preamble 

 Patch data follows 
diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST'
Index: ./MANIFEST
*** ./MANIFEST  Sun Oct 28 17:11:21 2001
--- ./MANIFEST  Sun Oct 28 17:11:07 2001
***
*** 1,5 
--- 1,8 
  assemble.pl
  ChangeLog
+ chartype.c
+ chartypes/unicode.c
+ chartypes/usascii.c
  classes/genclass.pl
  classes/intclass.c
  classes/scalarclass.c
***
*** 15,20 
--- 18,28 
  docs/parrotbyte.pod
  docs/strings.pod
  docs/vtables.pod
+ encoding.c
+ encodings/singlebyte.c
+ encodings/utf8.c
+ encodings/utf16.c
+ encodings/utf32.c
  examples/assembly/bsr.pasm
  examples/assembly/call.pasm
  examples/assembly/euclid.pasm
***
*** 30,35 
--- 38,45 
  global_setup.c
  hints/mswin32.pl
  hints/vms.pl
+ include/parrot/chartype.h
+ include/parrot/encoding.h
  include/parrot/events.h
  include/parrot/exceptions.h
  include/parrot/global_setup.h
***
*** 46,56 
  include/parrot/runops_cores.h
  include/parrot/stacks.h
  include/parrot/string.h
- include/parrot/strnative.h
- include/parrot/strutf16.h
- include/parrot/strutf32.h
- include/parrot/strutf8.h
- include/parrot/transcode.h
  include/parrot/trace.h
  include/parrot/unicode.h
  interpreter.c
--- 56,61 
***
*** 108,117 
  runops_cores.c
  stacks.c
  string.c
- strnative.c
- strutf16.c
- strutf32.c
- strutf8.c
  test_c.in
  test_main.c
  Test/More.pm
--- 113,118 
***
*** 129,135 
  t/op/time.t
  t/op/trans.t
  trace.c
- transcode.c
  Types_pm.in
  vtable_h.pl
  vtable.tbl
--- 130,135 
diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in'
Index: ./Makefile.in
*** ./Makefile.in   Wed Oct 24 19:23:47 2001
--- ./Makefile.in   Sat Oct 27 15:02:45 2001
***
*** 11,19 
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \
! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \
! trace$(O) vtable_ops$(O) 

RE: String rationale

2001-10-29 Thread Stephen Howard

You might consider requiring all character sets be able to convert to Unicode, and 
otherwise only have to know how to convert other
character sets to it's own set.

-Original Message-
From: Tom Hughes [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 29, 2001 02:31 PM
To: [EMAIL PROTECTED]
Subject: Re: String rationale


In message [EMAIL PROTECTED]
  Dan Sugalski [EMAIL PROTECTED] wrote:

 At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:

 Attached is my first pass at this - it's not fully ready yet but
 is something for people to cast an eye over before I spend lots of
 time going down the wrong path ;-)

 It looks pretty good on first glance.

I've done a bit more work now, and the latest version is attached.

This version can do transcoding. The intention is that there will be
some sort of cache in chartype_lookup_transcoder to avoid repeating
the expensive lookups by name too much.

One interesting question is who is responsible for transcoding
from character set A to character set B - is it A or B? and how
about the other way?

My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else.

Tom

--
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




RE: String rationale

2001-10-29 Thread Dan Sugalski

At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote:
You might consider requiring all character sets be able to convert to Unicode,

That's already a requirement. All character sets must be able to go to or 
come from Unicode. They can do others if they want, but it's not required. 
(And we'll have to figure out how to allow that reasonably efficiently)

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




RE: String rationale

2001-10-29 Thread Stephen Howard

right.  I had just keyed in on this from Tom's message:

My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else.

...which seemed to posit that Unicode module could be responsible for all the 
transcodings to and from it's own character set, which
seemed backwards to me.

-Stephen

-Original Message-
From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 29, 2001 02:43 PM
To: Stephen Howard; Tom Hughes; [EMAIL PROTECTED]
Subject: RE: String rationale


At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote:
You might consider requiring all character sets be able to convert to Unicode,

That's already a requirement. All character sets must be able to go to or
come from Unicode. They can do others if they want, but it's not required.
(And we'll have to figure out how to allow that reasonably efficiently)

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk





RE: String rationale

2001-10-29 Thread Tom Hughes

In message [EMAIL PROTECTED]
  Stephen Howard [EMAIL PROTECTED] wrote:

 right.  I had just keyed in on this from Tom's message:
 
 My code currently allows either set to provide the transform on the
 grounds that otherwise the unicode module would have to either know
 how to convert to everything else or from everything else.
 
 ...which seemed to posit that Unicode module could be responsible for
 all the transcodings to and from it's own character set, which seemed
 backwards to me.

I was only positing it long enough to acknowledge that such a rule
was untenable.

What it comes down to is that there are three possibles rules, namely:

  1. Each character set defines transforms from itself to other
 character sets.

  2. Each character set defines transforms to itself from other
 character sets.

  3. Each character set defines transforms both from itself to
 other character sets and from other character sets to itself.

We have established that the first two will not work because of the
unicode problem.

That leaves the third, which is what I have implemented. When looking to
transcode from A to B it will first ask A if can it transcode to B and
if that fails then it will ask B if it can transcode from A.

That way each character set can manage it's own translations both to
and from unicode as we require.

The problem it raises is, whois reponsible for transcoding from ASCII to
Latin-1? and back again? If we're not careful both ends will implement
both translations and we will have effective duplication.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-29 Thread James Mastros

On Mon, Oct 29, 2001 at 08:32:16PM +, Tom Hughes wrote:
 We have established that the first two will not work because of the
 unicode problem.
Hm.  I think instead of requiring Unicode to support everything, we should
require Unicode to support /nothing/.  If A and B have no mutual transcoding
function, we should use Unicode as a intermediary.  (This means that
charsets that are lossy to unicode need to transcode to eachother directly,
like Far Eastern sets.  (And Klingon, but that can't transcode to anything.))

This still makes Unicode a special case, but not a terrible one.  (In fact,
unicode can be treated like any other charset, except when we want to
trancode between mutualy incompatable sets, since we always try both A-B
and A-B.

(Notational note: A-B means that A is implementing a transcoding from itself
to B.  A-B means that A is implementing a transcoding from B to A.)

 That leaves the third, which is what I have implemented. When looking to
 transcode from A to B it will first ask A if can it transcode to B and
 if that fails then it will ask B if it can transcode from A.
I propose another variant on this:
If that fails, it asks A to transcode to Unicode, and B to transcode from
Unicode.  (Not Unicode to transcode to B; Unicode implements no transcodings.)

 The problem it raises is, whois reponsible for transcoding from ASCII to
 Latin-1? and back again? If we're not careful both ends will implement
 both translations and we will have effective duplication.
1) Neither.  Each must support transcoding to and from Unicode.
2) But either can support converting directly if it wants.

I also think that, for efficency, we might want a 7-bit chars match ASCII
flag, since most charactersets do, and that means that we don't have to deal
with the overhead for strings that fit in 7 bits.  This smells of premature
optimization, though, so sombody just file this away in their heads for
future reference.

That would also mean that neither is responsible for converting between
Latin-1 and ASCII, because core will do it, most of the time, and the rest
of the time, it isn't possible.

Hm.  But it isn't possible _losslessly_, though it is possibly lossfuly.
IMHO, there should be two ways to transcode, or the transcoding function
should flag to it's caller somehow.

(Sorry for the train-of-thought, but I think it's decently clear.)

(BTW, for those paying attention, I'm waiting on this discussion for my
chr/ord patch, since I want them in terms of charsets, not encodings.)

   -=- James Mastros



Re: String rationale

2001-10-29 Thread Tom Hughes

In message [EMAIL PROTECTED]
  James Mastros [EMAIL PROTECTED] wrote:

  That leaves the third, which is what I have implemented. When looking to
  transcode from A to B it will first ask A if can it transcode to B and
  if that fails then it will ask B if it can transcode from A.
 I propose another variant on this:
 If that fails, it asks A to transcode to Unicode, and B to transcode from
 Unicode.  (Not Unicode to transcode to B; Unicode implements no transcodings.)

My code does that, though at a slightly higher level. If you look
at string_transcode() you will see that if it can't find a direct
mapping it will go via unicode. If C had closures then I'd have
buried that down in the chartype_lookup_transcoder() layer, but it
doesn't so I couldn't ;-)

  The problem it raises is, whois reponsible for transcoding from ASCII to
  Latin-1? and back again? If we're not careful both ends will implement
  both translations and we will have effective duplication.
 1) Neither.  Each must support transcoding to and from Unicode.

Absolutely.

 2) But either can support converting directly if it wants.

The danger is that everybody tries to be clever and support direct
conversion to and from as many other character sets as possible, which
leads to lots of duplication.

 I also think that, for efficency, we might want a 7-bit chars match ASCII
 flag, since most charactersets do, and that means that we don't have to deal
 with the overhead for strings that fit in 7 bits.  This smells of premature
 optimization, though, so sombody just file this away in their heads for
 future reference.

I have already been thinking about this although it does get more
complicated as you have to consider the encoding as well - if you
have a single byte encoded ASCII string then transcoding to a single
byte encoded Latin-1 string is a no-op, but that may not be true for
other encodings if such a thing makes sense for those character types.

 (BTW, for those paying attention, I'm waiting on this discussion for my
 chr/ord patch, since I want them in terms of charsets, not encodings.)

I suspect that the encode and decode methods in the encoding vtable
are enough for doing chr/ord aren't they?

Surely chr() is just encoding the argument in the chosen encoding (which
can be the default encoding for the char type if you want) and then setting
the type and encoding of the resulting string appropriately.

Equally ord() is decoding the first character of the string to get a
number.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-29 Thread James Mastros

On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote:
  2) But either can support converting directly if it wants.
 The danger is that everybody tries to be clever and support direct
 conversion to and from as many other character sets as possible, which
 leads to lots of duplication.
Yeah.  But that's a convention thing, I think.  I also think that most
people won't go to the bother of writing conversion functions that they
don't have to.  What we need to worry about is both, say, big5 and shiftjis
writing both of the conversions.  And it shouldn't come up all that much,
because Unicode is /supposted to be/ lossless for most things.

 I have already been thinking about this although it does get more
 complicated as you have to consider the encoding as well - if you
 have a single byte encoded ASCII string then transcoding to a single
 byte encoded Latin-1 string is a no-op, but that may not be true for
 other encodings if such a thing makes sense for those character types.
Hm.  All the encodings I can think of (which is rather limited -- the UTFs),
you can scan for units (IE ints of the proper size)  0x7f, and if you don't
find any, it's 7bit, and you can just change the charset marker without
doing any work.

In any case, it's up to the encoding to tell if we've got a pure 7bit
string.  If that's complicated for it, it can just always return FALSE.

 I suspect that the encode and decode methods in the encoding vtable
 are enough for doing chr/ord aren't they?
Hmm... come to think of it, yes.  chr will always create a utf32-encoded
string with the given charset number (or unicode for the two-arg version),
ord will return the codepoint within the current charset.

(This, BTW, means that only encodings that feel like it have to provide
either, but all encodings must be able to convert to utf32.)

Powers-that-be (I'm looking at you, Dan), is that good?

   -=- James Mastros



Re: String rationale

2001-10-27 Thread Tom Hughes

In message [EMAIL PROTECTED]
  Tom Hughes [EMAIL PROTECTED] wrote:

 Other than that it looked quite good and I'll probably start looking at
 bending the existing code into the new model over the weekend.

Attached is my first pass at this - it's not fully ready yet but
is something for people to cast an eye over before I spend lots of
time going down the wrong path ;-)

The encoding_lookup() and chartype_lookup() routines will obviously
need to load the relevant libraries on the fly when we have support
for that.

The packfile stuff is just a hack to make it work for now. Presumably
we will have to modify the byte code format to record the string types
as names or something so we can look them up properly?

String comparison is not language sensitive here - as before it just
compares based on character values.

Other than that I think it's aiming in the right direction and it does
pass all the tests... Please correct me if I'm wrong.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/


# This is a patch for parrot to update it to parrot-ns
# 
# To apply this patch:
# STEP 1: Chdir to the source directory.
# STEP 2: Run the 'applypatch' program with this patch file as input.
#
# If you do not have 'applypatch', it is part of the 'makepatch' package
# that you can fetch from the Comprehensive Perl Archive Network:
# http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz
# In the above URL, 'x' should be 2 or higher.
#
# To apply this patch without the use of 'applypatch':
# STEP 1: Chdir to the source directory.
# If you have a decent Bourne-type shell:
# STEP 2: Run the shell with this file as input.
# If you don't have such a shell, you may need to manually create/delete
# the files/directories as shown below.
# STEP 3: Run the 'patch' program with this file as input.
#
# These are the commands needed to create/delete files/directories:
#
mkdir 'chartypes'
chmod 0755 'chartypes'
mkdir 'encodings'
chmod 0755 'encodings'
rm -f 'transcode.c'
rm -f 'strutf8.c'
rm -f 'strutf32.c'
rm -f 'strutf16.c'
rm -f 'strnative.c'
rm -f 'include/parrot/transcode.h'
rm -f 'include/parrot/strutf8.h'
rm -f 'include/parrot/strutf32.h'
rm -f 'include/parrot/strutf16.h'
rm -f 'include/parrot/strnative.h'
touch 'chartype.c'
chmod 0644 'chartype.c'
touch 'chartypes/unicode.c'
chmod 0644 'chartypes/unicode.c'
touch 'chartypes/usascii.c'
chmod 0644 'chartypes/usascii.c'
touch 'encoding.c'
chmod 0644 'encoding.c'
touch 'encodings/singlebyte.c'
chmod 0644 'encodings/singlebyte.c'
touch 'encodings/utf16.c'
chmod 0644 'encodings/utf16.c'
touch 'encodings/utf32.c'
chmod 0644 'encodings/utf32.c'
touch 'encodings/utf8.c'
chmod 0644 'encodings/utf8.c'
touch 'include/parrot/chartype.h'
chmod 0644 'include/parrot/chartype.h'
touch 'include/parrot/encoding.h'
chmod 0644 'include/parrot/encoding.h'
#
# This command terminates the shell and need not be executed manually.
exit
#
 End of Preamble 

 Patch data follows 
diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST'
Index: ./MANIFEST
*** ./MANIFEST  Wed Oct 24 22:16:51 2001
--- ./MANIFEST  Sat Oct 27 14:59:43 2001
***
*** 1,5 
--- 1,8 
  assemble.pl
  ChangeLog
+ chartype.c
+ chartypes/unicode.c
+ chartypes/usascii.c
  classes/genclass.pl
  classes/intclass.c
  config_h.in
***
*** 14,19 
--- 17,27 
  docs/parrotbyte.pod
  docs/strings.pod
  docs/vtables.pod
+ encoding.c
+ encodings/singlebyte.c
+ encodings/utf8.c
+ encodings/utf16.c
+ encodings/utf32.c
  examples/assembly/bsr.pasm
  examples/assembly/call.pasm
  examples/assembly/euclid.pasm
***
*** 29,34 
--- 37,44 
  global_setup.c
  hints/mswin32.pl
  hints/vms.pl
+ include/parrot/chartype.h
+ include/parrot/encoding.h
  include/parrot/events.h
  include/parrot/exceptions.h
  include/parrot/global_setup.h
***
*** 45,55 
  include/parrot/runops_cores.h
  include/parrot/stacks.h
  include/parrot/string.h
- include/parrot/strnative.h
- include/parrot/strutf16.h
- include/parrot/strutf32.h
- include/parrot/strutf8.h
- include/parrot/transcode.h
  include/parrot/trace.h
  include/parrot/unicode.h
  interpreter.c
--- 55,60 
***
*** 107,116 
  runops_cores.c
  stacks.c
  string.c
- strnative.c
- strutf16.c
- strutf32.c
- strutf8.c
  test_c.in
  test_main.c
  Test/More.pm
--- 112,117 
***
*** 128,134 
  t/op/time.t
  t/op/trans.t
  trace.c
- transcode.c
  Types_pm.in
  vtable_h.pl
  vtable.tbl
--- 129,134 
diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in'
Index: ./Makefile.in
*** ./Makefile.in   Wed Oct 24 19:23:47 2001
--- ./Makefile.in   Sat Oct 27 15:02:45 2001
***
*** 11,19 
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \
! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \
! trace$(O) vtable_ops$(O) 

Re: String rationale

2001-10-27 Thread Tom Hughes

In message [EMAIL PROTECTED]
  Tom Hughes [EMAIL PROTECTED] wrote:

 Attached is my first pass at this - it's not fully ready yet but
 is something for people to cast an eye over before I spend lots of
 time going down the wrong path ;-)

Before anybody else spots, let me just add what I forget to mention
in my original post, which is that transcoding isn't implemented yet
as I'm still thinking about the best way to do it. There is a hook
in place ready for it though.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-27 Thread Dan Sugalski

At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:
In message [EMAIL PROTECTED]
   Tom Hughes [EMAIL PROTECTED] wrote:

  Other than that it looked quite good and I'll probably start looking at
  bending the existing code into the new model over the weekend.

Attached is my first pass at this - it's not fully ready yet but
is something for people to cast an eye over before I spend lots of
time going down the wrong path ;-)

It looks pretty good on first glance.

The packfile stuff is just a hack to make it work for now. Presumably
we will have to modify the byte code format to record the string types
as names or something so we can look them up properly?

Yup. I think tagging the strings with a few type integers and a set of 
name-type tables in the bytecode are going to be needed for this.

String comparison is not language sensitive here - as before it just
compares based on character values.

I'm still unsure as to how to properly handle locale-aware comparison, 
which is an interesting problem in and of itself. Luckily we just need to 
make the facilities for it, and someone else handles the policy. :)

Other than that I think it's aiming in the right direction and it does
pass all the tests... Please correct me if I'm wrong.

Let me mull it over a bit. I think I'm going to commit it, but a second 
think on it won't hurt.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




String rationale

2001-10-25 Thread Dan Sugalski

'Kay, here's the string background info I promised. If things are missing 
or unclear let me know and I'll fix it up until it is.


==Cut here with a very sharp knife===
=head1 TITLE

A parrot string backgrounder

=head1 Overview

Strings, in parrot, are compartmentalized, the same way so much else
in Parrot is compartmentalized. There's no single 'blessed' string
encoding--the closest we come is Unicode, and only as an encoding of
last resort. (Unicode's not a good interchange format, as it loses
information)

=head2 From the Outside

On the outside, the interpreter considers strings to be a sort of
black box. The only bits of the interpreter that much care about the
string data are the regex engine parts, and those only operate on
fixed-sized data.

The interpreter can only peek inside a string if that string is of
fixed length, and the interpreter doesn't actually care about the
character set the data is in. All character sets must provide a way to
transcode to Unicode, and all character encodings must provide a way
to turn their characters into fixed-sized entities. (The size may be
8, 16, or 32 bits as need be for the character set)

Character sets may provide a way to transcode to non-Unicode sets, for
example from EBCDIC to ASCII, but this is optional. If none is
provided a transcoding from one set to another will use Unicode as an
intermediate form, complete with potential data loss.

All character sets must provide the character lists the regular
expression engine needs for the base character classes. (space, word,
and digit characters) This permits the regular expression code to
operate on the contents of a string without needing to know its actual
character set.

=head2 From the Inside

=head2 Technical details

The base string structure looks like:

   struct parrot_string {
 void *bufstart;
 INTVAL buflen;
 INTVAL bufused;
 INTVAL flags;
 INTVAL strlen;
 STRING_VTABLE* encoding;
 INTVAL type;
 INTVAL lanugage;
   }


=head2 Fields

=over 4

=item bufstart

Where the string buffer starts

=item buflen

How big the buffer is

=item bufused

How much of the buffer's used

=item flags

A variety of flags. Low 16 bits reserved to Parrot, the rest are free
for the string encoding library to use

=item strlen

How long the string is in code points. (Note that, for encodings that
are more than 8 bits per code point, or of variable length, this will
Enot be the same as the buffer used.

=item encoding

Pointer to the library that handles the string encoding. Encoding is
basically how the stream of bytes pointed to by Cbufstart can be
turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first

=item type

What the character set or type of data is encoded in the buffer. This
includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
combination of type and encoding. I'll update the doc as soon as I can
reasonablty separate the two)

=item language

The language the string is in. This is essential for proper sorting,
if a sort function wants to be language-aware. Just an encoding/type
is insufficient for proper sorting--for example knowing a string is
UTF-32/Unicode doesn't tell you how the data should be ordered. This
is especially important for those languages that overlap in the
Unicode code space. Japanese and Chinese, for example, share many of
the Unicode code points but sort those code points differently.

=back

Libraries for processing character sets and encodings are shareable
libraries, and may be loaded on demand. They are looked up and
referenced by name. An identifying number is given to them at load
time and shouldn't be used outside the currently running
process. (EBCDIC might be character set 3 in one run and set 7 in
another)

The native encoding and character set is Inever considered a 'real'
encoding or character set. It just specifies what the default is if
nothing else is specified, but when bytecode is frozen to disk the
actual encoding or set name will be used instead.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-25 Thread Sam Tregar

On Thu, 25 Oct 2001, Dan Sugalski wrote:

 The only bits of the interpreter that much care about the
 string data are the regex engine parts, and those only operate on
 fixed-sized data.

Care to elaborate?  I thought the mandate from Larry was to have regexes
compile down to a stream of string ops.  Doesn't that mean it should work
regardless of the encoding of the string?

 The interpreter can only peek inside a string if that string is of
 fixed length, and the interpreter doesn't actually care about the
 character set the data is in.

Why is this necessary at all?  Wouldn't it be prefereable to have all
access go through the String vtable regardless of the encoding?

 =item encoding

 Pointer to the library that handles the string encoding. Encoding is
 basically how the stream of bytes pointed to by Cbufstart can be
 turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
 5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first

.first?

Aside from the above, this was a nice refresher.

-sam




Re: String rationale

2001-10-25 Thread Dan Sugalski

At 12:19 PM 10/25/2001 -0400, Sam Tregar wrote:
On Thu, 25 Oct 2001, Dan Sugalski wrote:

  The only bits of the interpreter that much care about the
  string data are the regex engine parts, and those only operate on
  fixed-sized data.

Care to elaborate?  I thought the mandate from Larry was to have regexes
compile down to a stream of string ops.  Doesn't that mean it should work
regardless of the encoding of the string?

Since the encoding just determines how the abstract code point numbers are 
represented in bytes, I'm OK with requiring strings we process internally 
to be in a fixed-size version.

And regexes will be done with a stream of parrot opcodes, presuming that's 
not too slow. There'll be ops to reference the code point at position X in 
a string and check to see if its in a list of other code points and 
suchlike things. Basically we'll peek under the covers, but only for 
fixed-length strings.

  The interpreter can only peek inside a string if that string is of
  fixed length, and the interpreter doesn't actually care about the
  character set the data is in.

Why is this necessary at all?  Wouldn't it be prefereable to have all
access go through the String vtable regardless of the encoding?

Speed. We're going to take something of a hit decomposing to ops as it 
is--if we can safely cheat, I'm OK with mandating it to be required. :)

  =item encoding
 
  Pointer to the library that handles the string encoding. Encoding is
  basically how the stream of bytes pointed to by Cbufstart can be
  turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
  5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first

.first?

Trailing buffer gook.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-25 Thread Tom Hughes

In message [EMAIL PROTECTED]
  Dan Sugalski [EMAIL PROTECTED] wrote:

 =item type
 
 What the character set or type of data is encoded in the buffer. This
 includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
 Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
 combination of type and encoding. I'll update the doc as soon as I can
 reasonablty separate the two)

Isn't this going to need to be a vtable pointer like encoding is? Only
some things (like character classification and at least some transcoding
tasks) will be character set based rather than encoding based.

Other than that it looked quite good and I'll probably start looking at
bending the existing code into the new model over the weekend.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-25 Thread Dan Sugalski

At 11:59 PM 10/25/2001 +0100, Tom Hughes wrote:
In message [EMAIL PROTECTED]
   Dan Sugalski [EMAIL PROTECTED] wrote:

  =item type
 
  What the character set or type of data is encoded in the buffer. This
  includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
  Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
  combination of type and encoding. I'll update the doc as soon as I can
  reasonablty separate the two)

Isn't this going to need to be a vtable pointer like encoding is?

Yup. I'd intended it to be an index into a table of character set 
functions. Jarkko has convinced me that it's better to have it as a vtable 
pointer, but I haven't had a chance to update the docs yet.


Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk