Re: String rationale

2001-11-01 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
Simon Cozens <[EMAIL PROTECTED]> wrote:

> As things stand, that won't work, because you're doing a string lookup in one
> of the core functions, and you still need some way of registering incoming
> stuff. With an enum, you can keep hold of a fake encoding_max, and hand
> encoding_max++ to the initialisation function for each encoding.

Well there won't be any point in it being an enum rather that an 
integer unless some of them are going to be preallocated. I'm not
sure if the encoding and character types will need to know their
own index numbers but if we do then they can be told at initialisation
time, yes.

I absolutely intend that the current hard coded strings in the core
will go away in due course though. When you look up an encoding or
character type by name it will first check a hash table or something
to see if it is already loaded and if not it will look for it on disk
and load it in, allocate it a number, and add it to the hash table
for future reference.

Hence the current strcmp junk in the lookup functions will go away.

In much the same way the byte code will have some sort of table of
names which it will look up as it is loaded rather than the current
hard coding of name to number mappings in the byte code.

So all I need now to make all this work is hash tables and dynamic
code loading ;-) Any volunteers...

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu




Re: String rationale

2001-11-01 Thread Simon Cozens

On Thu, Nov 01, 2001 at 02:18:17PM +, Tom Hughes wrote:
> > Could you try rewriting them using an enum, like the vtable stuff and
> > the original string encoding stuff does?
> 
> Allocating them globally is not possible if we're going allow people
> to add arbitrary encodings and character sets - as things stand adding
> the foo encoding will be as simple as adding foo.so to the encodings
> directory.

As things stand, that won't work, because you're doing a string lookup in one
of the core functions, and you still need some way of registering incoming
stuff. With an enum, you can keep hold of a fake encoding_max, and hand
encoding_max++ to the initialisation function for each encoding.

-- 
Relf Test Passed.



Re: String rationale

2001-11-01 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
Simon Cozens <[EMAIL PROTECTED]> wrote:

> On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote:
> > The encoding_lookup() and chartype_lookup() routines will obviously
> > need to load the relevant libraries on the fly when we have support
> > for that.
> 
> Could you try rewriting them using an enum, like the vtable stuff and
> the original string encoding stuff does?

The intention is that when an encoding or character type is loaded it
will be allocated a unique ID number that can be used internally to
refer to it, but that the number will only valid for the duration of
that instance of parrot rather than being persistent. That's certainly
the way Dan described it happening in his rationale which is what my
code is based on.

Allocating them globally is not possible if we're going allow people
to add arbitrary encodings and character sets - as things stand adding
the foo encoding will be as simple as adding foo.so to the encodings
directory.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu




Re: String rationale

2001-11-01 Thread Simon Cozens

On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote:
> The encoding_lookup() and chartype_lookup() routines will obviously
> need to load the relevant libraries on the fly when we have support
> for that.

Could you try rewriting them using an enum, like the vtable stuff and
the original string encoding stuff does?

-- 
An algorithm must be seen to be believed.
-- D.E. Knuth



Re: String rationale

2001-10-31 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Tom Hughes <[EMAIL PROTECTED]> wrote:

> In message <[EMAIL PROTECTED]>
>   Dan Sugalski <[EMAIL PROTECTED]> wrote:
> 
> > At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:
> >
> > >Attached is my first pass at this - it's not fully ready yet but
> > >is something for people to cast an eye over before I spend lots of
> > >time going down the wrong path ;-)
> > 
> > It looks pretty good on first glance.
> 
> I've done a bit more work now, and the latest version is attached.

Unless anybody has objections I plan to commit this work shortly...

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-30 Thread Dan Sugalski

At 07:16 PM 10/29/2001 -0500, James Mastros wrote:
>Yeah.  But that's a convention thing, I think.  I also think that most
>people won't go to the bother of writing conversion functions that they
>don't have to.  What we need to worry about is both, say, big5 and shiftjis
>writing both of the conversions.  And it shouldn't come up all that much,
>because Unicode is /supposted to be/ lossless for most things.

Supposed to be, yep. Whether it *is* or not is another issue entirely. :)

> > I suspect that the encode and decode methods in the encoding vtable
> > are enough for doing chr/ord aren't they?
>Hmm... come to think of it, yes.  chr will always create a utf32-encoded
>string with the given charset number (or unicode for the two-arg version),
>ord will return the codepoint within the current charset.

Erk. No. chr should give you a string in the encoding you've selected, or 
the default encoding if you've not selected one. That may not be (probably 
won't be) UTF32.

>(This, BTW, means that only encodings that feel like it have to provide
>either, but all encodings must be able to convert to utf32.)

More or less, yep. Everyone has to go to UTF32. Direct encoding to encoding 
is optional. Encouraged in those cases where it's either quicker or less 
uncertain.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-30 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
James Mastros <[EMAIL PROTECTED]> wrote:

> On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote:
> 
> > I suspect that the encode and decode methods in the encoding vtable
> > are enough for doing chr/ord aren't they?
>
> Hmm... come to think of it, yes.  chr will always create a utf32-encoded
> string with the given charset number (or unicode for the two-arg version),
> ord will return the codepoint within the current charset.

I hope it will create a string with the given charset number and
using the default encoding for that charset.

Asking for an ASCII character and getting it UTF-32 encoded would
be more that a little bizarre. If I say chr(65,ASCII) then I would
expect to get a single byte encoded string...

> (This, BTW, means that only encodings that feel like it have to provide
> either, but all encodings must be able to convert to utf32.)

The way I've written it, any encoding can convert to any encoding
at all, because there is no conversion at that level. I just decode
a character from the source, transcode it at the character level, and
then encode it to the destination.

If an encoding cannot handle the full range of character values for
a character set then you will get an exception when it tries to encode
an out of range character.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu




Re: String rationale

2001-10-29 Thread James Mastros

On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote:
> > 2) But either can support converting directly if it wants.
> The danger is that everybody tries to be clever and support direct
> conversion to and from as many other character sets as possible, which
> leads to lots of duplication.
Yeah.  But that's a convention thing, I think.  I also think that most
people won't go to the bother of writing conversion functions that they
don't have to.  What we need to worry about is both, say, big5 and shiftjis
writing both of the conversions.  And it shouldn't come up all that much,
because Unicode is /supposted to be/ lossless for most things.

> I have already been thinking about this although it does get more
> complicated as you have to consider the encoding as well - if you
> have a single byte encoded ASCII string then transcoding to a single
> byte encoded Latin-1 string is a no-op, but that may not be true for
> other encodings if such a thing makes sense for those character types.
Hm.  All the encodings I can think of (which is rather limited -- the UTFs),
you can scan for units (IE ints of the proper size) > 0x7f, and if you don't
find any, it's 7bit, and you can just change the charset marker without
doing any work.

In any case, it's up to the encoding to tell if we've got a pure 7bit
string.  If that's complicated for it, it can just always return FALSE.

> I suspect that the encode and decode methods in the encoding vtable
> are enough for doing chr/ord aren't they?
Hmm... come to think of it, yes.  chr will always create a utf32-encoded
string with the given charset number (or unicode for the two-arg version),
ord will return the codepoint within the current charset.

(This, BTW, means that only encodings that feel like it have to provide
either, but all encodings must be able to convert to utf32.)

Powers-that-be (I'm looking at you, Dan), is that good?

   -=- James Mastros



Re: String rationale

2001-10-29 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  James Mastros <[EMAIL PROTECTED]> wrote:

> > That leaves the third, which is what I have implemented. When looking to
> > transcode from A to B it will first ask A if can it transcode to B and
> > if that fails then it will ask B if it can transcode from A.
> I propose another variant on this:
> If that fails, it asks A to transcode to Unicode, and B to transcode from
> Unicode.  (Not Unicode to transcode to B; Unicode implements no transcodings.)

My code does that, though at a slightly higher level. If you look
at string_transcode() you will see that if it can't find a direct
mapping it will go via unicode. If C had closures then I'd have
buried that down in the chartype_lookup_transcoder() layer, but it
doesn't so I couldn't ;-)

> > The problem it raises is, whois reponsible for transcoding from ASCII to
> > Latin-1? and back again? If we're not careful both ends will implement
> > both translations and we will have effective duplication.
> 1) Neither.  Each must support transcoding to and from Unicode.

Absolutely.

> 2) But either can support converting directly if it wants.

The danger is that everybody tries to be clever and support direct
conversion to and from as many other character sets as possible, which
leads to lots of duplication.

> I also think that, for efficency, we might want a "7-bit chars match ASCII"
> flag, since most charactersets do, and that means that we don't have to deal
> with the overhead for strings that fit in 7 bits.  This smells of premature
> optimization, though, so sombody just file this away in their heads for
> future reference.

I have already been thinking about this although it does get more
complicated as you have to consider the encoding as well - if you
have a single byte encoded ASCII string then transcoding to a single
byte encoded Latin-1 string is a no-op, but that may not be true for
other encodings if such a thing makes sense for those character types.

> (BTW, for those paying attention, I'm waiting on this discussion for my
> chr/ord patch, since I want them in terms of charsets, not encodings.)

I suspect that the encode and decode methods in the encoding vtable
are enough for doing chr/ord aren't they?

Surely chr() is just encoding the argument in the chosen encoding (which
can be the default encoding for the char type if you want) and then setting
the type and encoding of the resulting string appropriately.

Equally ord() is decoding the first character of the string to get a
number.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-29 Thread James Mastros

On Mon, Oct 29, 2001 at 08:32:16PM +, Tom Hughes wrote:
> We have established that the first two will not work because of the
> unicode problem.
Hm.  I think instead of requiring Unicode to support everything, we should
require Unicode to support /nothing/.  If A and B have no mutual transcoding
function, we should use Unicode as a intermediary.  (This means that
charsets that are lossy to unicode need to transcode to eachother directly,
like Far Eastern sets.  (And Klingon, but that can't transcode to anything.))

This still makes Unicode a special case, but not a terrible one.  (In fact,
unicode can be treated like any other charset, except when we want to
trancode between mutualy incompatable sets, since we always try both A->B
and A<-B.

(Notational note: A->B means that A is implementing a transcoding from itself
to B.  A<-B means that A is implementing a transcoding from B to A.)

> That leaves the third, which is what I have implemented. When looking to
> transcode from A to B it will first ask A if can it transcode to B and
> if that fails then it will ask B if it can transcode from A.
I propose another variant on this:
If that fails, it asks A to transcode to Unicode, and B to transcode from
Unicode.  (Not Unicode to transcode to B; Unicode implements no transcodings.)

> The problem it raises is, whois reponsible for transcoding from ASCII to
> Latin-1? and back again? If we're not careful both ends will implement
> both translations and we will have effective duplication.
1) Neither.  Each must support transcoding to and from Unicode.
2) But either can support converting directly if it wants.

I also think that, for efficency, we might want a "7-bit chars match ASCII"
flag, since most charactersets do, and that means that we don't have to deal
with the overhead for strings that fit in 7 bits.  This smells of premature
optimization, though, so sombody just file this away in their heads for
future reference.

That would also mean that neither is responsible for converting between
Latin-1 and ASCII, because core will do it, most of the time, and the rest
of the time, it isn't possible.

Hm.  But it isn't possible _losslessly_, though it is possibly lossfuly.
IMHO, there should be two ways to transcode, or the transcoding function
should flag to it's caller somehow.

(Sorry for the train-of-thought, but I think it's decently clear.)

(BTW, for those paying attention, I'm waiting on this discussion for my
chr/ord patch, since I want them in terms of charsets, not encodings.)

   -=- James Mastros



RE: String rationale

2001-10-29 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  "Stephen Howard" <[EMAIL PROTECTED]> wrote:

> right.  I had just keyed in on this from Tom's message:
> 
> "My code currently allows either set to provide the transform on the
> grounds that otherwise the unicode module would have to either know
> how to convert to everything else or from everything else."
> 
> ...which seemed to posit that Unicode module could be responsible for
> all the transcodings to and from it's own character set, which seemed
> backwards to me.

I was only positing it long enough to acknowledge that such a rule
was untenable.

What it comes down to is that there are three possibles rules, namely:

  1. Each character set defines transforms from itself to other
 character sets.

  2. Each character set defines transforms to itself from other
 character sets.

  3. Each character set defines transforms both from itself to
 other character sets and from other character sets to itself.

We have established that the first two will not work because of the
unicode problem.

That leaves the third, which is what I have implemented. When looking to
transcode from A to B it will first ask A if can it transcode to B and
if that fails then it will ask B if it can transcode from A.

That way each character set can manage it's own translations both to
and from unicode as we require.

The problem it raises is, whois reponsible for transcoding from ASCII to
Latin-1? and back again? If we're not careful both ends will implement
both translations and we will have effective duplication.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




RE: String rationale

2001-10-29 Thread Stephen Howard

right.  I had just keyed in on this from Tom's message:

"My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else."

...which seemed to posit that Unicode module could be responsible for all the 
transcodings to and from it's own character set, which
seemed backwards to me.

-Stephen

-Original Message-
From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 29, 2001 02:43 PM
To: Stephen Howard; Tom Hughes; [EMAIL PROTECTED]
Subject: RE: String rationale


At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote:
>You might consider requiring all character sets be able to convert to Unicode,

That's already a requirement. All character sets must be able to go to or
come from Unicode. They can do others if they want, but it's not required.
(And we'll have to figure out how to allow that reasonably efficiently)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk





RE: String rationale

2001-10-29 Thread Dan Sugalski

At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote:
>You might consider requiring all character sets be able to convert to Unicode,

That's already a requirement. All character sets must be able to go to or 
come from Unicode. They can do others if they want, but it's not required. 
(And we'll have to figure out how to allow that reasonably efficiently)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




RE: String rationale

2001-10-29 Thread Stephen Howard

You might consider requiring all character sets be able to convert to Unicode, and 
otherwise only have to know how to convert other
character sets to it's own set.

-Original Message-
From: Tom Hughes [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 29, 2001 02:31 PM
To: [EMAIL PROTECTED]
Subject: Re: String rationale


In message <[EMAIL PROTECTED]>
  Dan Sugalski <[EMAIL PROTECTED]> wrote:

> At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:
>
> >Attached is my first pass at this - it's not fully ready yet but
> >is something for people to cast an eye over before I spend lots of
> >time going down the wrong path ;-)
>
> It looks pretty good on first glance.

I've done a bit more work now, and the latest version is attached.

This version can do transcoding. The intention is that there will be
some sort of cache in chartype_lookup_transcoder to avoid repeating
the expensive lookups by name too much.

One interesting question is who is responsible for transcoding
from character set A to character set B - is it A or B? and how
about the other way?

My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else.

Tom

--
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-29 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Dan Sugalski <[EMAIL PROTECTED]> wrote:

> At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:
>
> >Attached is my first pass at this - it's not fully ready yet but
> >is something for people to cast an eye over before I spend lots of
> >time going down the wrong path ;-)
> 
> It looks pretty good on first glance.

I've done a bit more work now, and the latest version is attached.

This version can do transcoding. The intention is that there will be
some sort of cache in chartype_lookup_transcoder to avoid repeating
the expensive lookups by name too much.

One interesting question is who is responsible for transcoding
from character set A to character set B - is it A or B? and how
about the other way?

My code currently allows either set to provide the transform on the
grounds that otherwise the unicode module would have to either know
how to convert to everything else or from everything else.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/


# This is a patch for parrot to update it to parrot-ns
# 
# To apply this patch:
# STEP 1: Chdir to the source directory.
# STEP 2: Run the 'applypatch' program with this patch file as input.
#
# If you do not have 'applypatch', it is part of the 'makepatch' package
# that you can fetch from the Comprehensive Perl Archive Network:
# http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz
# In the above URL, 'x' should be 2 or higher.
#
# To apply this patch without the use of 'applypatch':
# STEP 1: Chdir to the source directory.
# If you have a decent Bourne-type shell:
# STEP 2: Run the shell with this file as input.
# If you don't have such a shell, you may need to manually create/delete
# the files/directories as shown below.
# STEP 3: Run the 'patch' program with this file as input.
#
# These are the commands needed to create/delete files/directories:
#
mkdir 'chartypes'
chmod 0755 'chartypes'
mkdir 'encodings'
chmod 0755 'encodings'
rm -f 'transcode.c'
rm -f 'strutf8.c'
rm -f 'strutf32.c'
rm -f 'strutf16.c'
rm -f 'strnative.c'
rm -f 'include/parrot/transcode.h'
rm -f 'include/parrot/strutf8.h'
rm -f 'include/parrot/strutf32.h'
rm -f 'include/parrot/strutf16.h'
rm -f 'include/parrot/strnative.h'
touch 'chartype.c'
chmod 0644 'chartype.c'
touch 'chartypes/unicode.c'
chmod 0644 'chartypes/unicode.c'
touch 'chartypes/usascii.c'
chmod 0644 'chartypes/usascii.c'
touch 'encoding.c'
chmod 0644 'encoding.c'
touch 'encodings/singlebyte.c'
chmod 0644 'encodings/singlebyte.c'
touch 'encodings/utf16.c'
chmod 0644 'encodings/utf16.c'
touch 'encodings/utf32.c'
chmod 0644 'encodings/utf32.c'
touch 'encodings/utf8.c'
chmod 0644 'encodings/utf8.c'
touch 'include/parrot/chartype.h'
chmod 0644 'include/parrot/chartype.h'
touch 'include/parrot/encoding.h'
chmod 0644 'include/parrot/encoding.h'
#
# This command terminates the shell and need not be executed manually.
exit
#
 End of Preamble 

 Patch data follows 
diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST'
Index: ./MANIFEST
*** ./MANIFEST  Sun Oct 28 17:11:21 2001
--- ./MANIFEST  Sun Oct 28 17:11:07 2001
***
*** 1,5 
--- 1,8 
  assemble.pl
  ChangeLog
+ chartype.c
+ chartypes/unicode.c
+ chartypes/usascii.c
  classes/genclass.pl
  classes/intclass.c
  classes/scalarclass.c
***
*** 15,20 
--- 18,28 
  docs/parrotbyte.pod
  docs/strings.pod
  docs/vtables.pod
+ encoding.c
+ encodings/singlebyte.c
+ encodings/utf8.c
+ encodings/utf16.c
+ encodings/utf32.c
  examples/assembly/bsr.pasm
  examples/assembly/call.pasm
  examples/assembly/euclid.pasm
***
*** 30,35 
--- 38,45 
  global_setup.c
  hints/mswin32.pl
  hints/vms.pl
+ include/parrot/chartype.h
+ include/parrot/encoding.h
  include/parrot/events.h
  include/parrot/exceptions.h
  include/parrot/global_setup.h
***
*** 46,56 
  include/parrot/runops_cores.h
  include/parrot/stacks.h
  include/parrot/string.h
- include/parrot/strnative.h
- include/parrot/strutf16.h
- include/parrot/strutf32.h
- include/parrot/strutf8.h
- include/parrot/transcode.h
  include/parrot/trace.h
  include/parrot/unicode.h
  interpreter.c
--- 56,61 
***
*** 108,117 
  runops_cores.c
  stacks.c
  string.c
- strnative.c
- strutf16.c
- strutf32.c
- strutf8.c
  test_c.in
  test_main.c
  Test/More.pm
--- 113,118 
***
*** 129,135 
  t/op/time.t
  t/op/trans.t
  trace.c
- transcode.c
  Types_pm.in
  vtable_h.pl
  vtable.tbl
--- 130,135 
diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in'
Index: ./Makefile.in
*** ./Makefile.in   Wed Oct 24 19:23:47 2001
--- ./Makefile.in   Sat Oct 27 15:02:45 2001
***
*** 11,19 
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \
! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \
! trace$(O) vtable_ops$(O) cla

Re: String rationale

2001-10-27 Thread Dan Sugalski

At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote:
>In message <[EMAIL PROTECTED]>
>   Tom Hughes <[EMAIL PROTECTED]> wrote:
>
> > Other than that it looked quite good and I'll probably start looking at
> > bending the existing code into the new model over the weekend.
>
>Attached is my first pass at this - it's not fully ready yet but
>is something for people to cast an eye over before I spend lots of
>time going down the wrong path ;-)

It looks pretty good on first glance.

>The packfile stuff is just a hack to make it work for now. Presumably
>we will have to modify the byte code format to record the string types
>as names or something so we can look them up properly?

Yup. I think tagging the strings with a few type integers and a set of 
name->type tables in the bytecode are going to be needed for this.

>String comparison is not language sensitive here - as before it just
>compares based on character values.

I'm still unsure as to how to properly handle locale-aware comparison, 
which is an interesting problem in and of itself. Luckily we just need to 
make the facilities for it, and someone else handles the policy. :)

>Other than that I think it's aiming in the right direction and it does
>pass all the tests... Please correct me if I'm wrong.

Let me mull it over a bit. I think I'm going to commit it, but a second 
think on it won't hurt.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-27 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Tom Hughes <[EMAIL PROTECTED]> wrote:

> Attached is my first pass at this - it's not fully ready yet but
> is something for people to cast an eye over before I spend lots of
> time going down the wrong path ;-)

Before anybody else spots, let me just add what I forget to mention
in my original post, which is that transcoding isn't implemented yet
as I'm still thinking about the best way to do it. There is a hook
in place ready for it though.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-27 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Tom Hughes <[EMAIL PROTECTED]> wrote:

> Other than that it looked quite good and I'll probably start looking at
> bending the existing code into the new model over the weekend.

Attached is my first pass at this - it's not fully ready yet but
is something for people to cast an eye over before I spend lots of
time going down the wrong path ;-)

The encoding_lookup() and chartype_lookup() routines will obviously
need to load the relevant libraries on the fly when we have support
for that.

The packfile stuff is just a hack to make it work for now. Presumably
we will have to modify the byte code format to record the string types
as names or something so we can look them up properly?

String comparison is not language sensitive here - as before it just
compares based on character values.

Other than that I think it's aiming in the right direction and it does
pass all the tests... Please correct me if I'm wrong.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/


# This is a patch for parrot to update it to parrot-ns
# 
# To apply this patch:
# STEP 1: Chdir to the source directory.
# STEP 2: Run the 'applypatch' program with this patch file as input.
#
# If you do not have 'applypatch', it is part of the 'makepatch' package
# that you can fetch from the Comprehensive Perl Archive Network:
# http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz
# In the above URL, 'x' should be 2 or higher.
#
# To apply this patch without the use of 'applypatch':
# STEP 1: Chdir to the source directory.
# If you have a decent Bourne-type shell:
# STEP 2: Run the shell with this file as input.
# If you don't have such a shell, you may need to manually create/delete
# the files/directories as shown below.
# STEP 3: Run the 'patch' program with this file as input.
#
# These are the commands needed to create/delete files/directories:
#
mkdir 'chartypes'
chmod 0755 'chartypes'
mkdir 'encodings'
chmod 0755 'encodings'
rm -f 'transcode.c'
rm -f 'strutf8.c'
rm -f 'strutf32.c'
rm -f 'strutf16.c'
rm -f 'strnative.c'
rm -f 'include/parrot/transcode.h'
rm -f 'include/parrot/strutf8.h'
rm -f 'include/parrot/strutf32.h'
rm -f 'include/parrot/strutf16.h'
rm -f 'include/parrot/strnative.h'
touch 'chartype.c'
chmod 0644 'chartype.c'
touch 'chartypes/unicode.c'
chmod 0644 'chartypes/unicode.c'
touch 'chartypes/usascii.c'
chmod 0644 'chartypes/usascii.c'
touch 'encoding.c'
chmod 0644 'encoding.c'
touch 'encodings/singlebyte.c'
chmod 0644 'encodings/singlebyte.c'
touch 'encodings/utf16.c'
chmod 0644 'encodings/utf16.c'
touch 'encodings/utf32.c'
chmod 0644 'encodings/utf32.c'
touch 'encodings/utf8.c'
chmod 0644 'encodings/utf8.c'
touch 'include/parrot/chartype.h'
chmod 0644 'include/parrot/chartype.h'
touch 'include/parrot/encoding.h'
chmod 0644 'include/parrot/encoding.h'
#
# This command terminates the shell and need not be executed manually.
exit
#
 End of Preamble 

 Patch data follows 
diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST'
Index: ./MANIFEST
*** ./MANIFEST  Wed Oct 24 22:16:51 2001
--- ./MANIFEST  Sat Oct 27 14:59:43 2001
***
*** 1,5 
--- 1,8 
  assemble.pl
  ChangeLog
+ chartype.c
+ chartypes/unicode.c
+ chartypes/usascii.c
  classes/genclass.pl
  classes/intclass.c
  config_h.in
***
*** 14,19 
--- 17,27 
  docs/parrotbyte.pod
  docs/strings.pod
  docs/vtables.pod
+ encoding.c
+ encodings/singlebyte.c
+ encodings/utf8.c
+ encodings/utf16.c
+ encodings/utf32.c
  examples/assembly/bsr.pasm
  examples/assembly/call.pasm
  examples/assembly/euclid.pasm
***
*** 29,34 
--- 37,44 
  global_setup.c
  hints/mswin32.pl
  hints/vms.pl
+ include/parrot/chartype.h
+ include/parrot/encoding.h
  include/parrot/events.h
  include/parrot/exceptions.h
  include/parrot/global_setup.h
***
*** 45,55 
  include/parrot/runops_cores.h
  include/parrot/stacks.h
  include/parrot/string.h
- include/parrot/strnative.h
- include/parrot/strutf16.h
- include/parrot/strutf32.h
- include/parrot/strutf8.h
- include/parrot/transcode.h
  include/parrot/trace.h
  include/parrot/unicode.h
  interpreter.c
--- 55,60 
***
*** 107,116 
  runops_cores.c
  stacks.c
  string.c
- strnative.c
- strutf16.c
- strutf32.c
- strutf8.c
  test_c.in
  test_main.c
  Test/More.pm
--- 112,117 
***
*** 128,134 
  t/op/time.t
  t/op/trans.t
  trace.c
- transcode.c
  Types_pm.in
  vtable_h.pl
  vtable.tbl
--- 129,134 
diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in'
Index: ./Makefile.in
*** ./Makefile.in   Wed Oct 24 19:23:47 2001
--- ./Makefile.in   Sat Oct 27 15:02:45 2001
***
*** 11,19 
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \
! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \
! trace$(O) vtable_op

Re: String rationale

2001-10-25 Thread Dan Sugalski

At 11:59 PM 10/25/2001 +0100, Tom Hughes wrote:
>In message <[EMAIL PROTECTED]>
>   Dan Sugalski <[EMAIL PROTECTED]> wrote:
>
> > =item type
> >
> > What the character set or type of data is encoded in the buffer. This
> > includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
> > Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
> > combination of type and encoding. I'll update the doc as soon as I can
> > reasonablty separate the two)
>
>Isn't this going to need to be a vtable pointer like encoding is?

Yup. I'd intended it to be an index into a table of character set 
functions. Jarkko has convinced me that it's better to have it as a vtable 
pointer, but I haven't had a chance to update the docs yet.


Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-25 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Dan Sugalski <[EMAIL PROTECTED]> wrote:

> =item type
> 
> What the character set or type of data is encoded in the buffer. This
> includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
> Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
> combination of type and encoding. I'll update the doc as soon as I can
> reasonablty separate the two)

Isn't this going to need to be a vtable pointer like encoding is? Only
some things (like character classification and at least some transcoding
tasks) will be character set based rather than encoding based.

Other than that it looked quite good and I'll probably start looking at
bending the existing code into the new model over the weekend.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/




Re: String rationale

2001-10-25 Thread Dan Sugalski

At 12:19 PM 10/25/2001 -0400, Sam Tregar wrote:
>On Thu, 25 Oct 2001, Dan Sugalski wrote:
>
> > The only bits of the interpreter that much care about the
> > string data are the regex engine parts, and those only operate on
> > fixed-sized data.
>
>Care to elaborate?  I thought the mandate from Larry was to have regexes
>compile down to a stream of string ops.  Doesn't that mean it should work
>regardless of the encoding of the string?

Since the encoding just determines how the abstract code point numbers are 
represented in bytes, I'm OK with requiring strings we process internally 
to be in a fixed-size version.

And regexes will be done with a stream of parrot opcodes, presuming that's 
not too slow. There'll be ops to reference the code point at position X in 
a string and check to see if its in a list of other code points and 
suchlike things. Basically we'll peek under the covers, but only for 
fixed-length strings.

> > The interpreter can only peek inside a string if that string is of
> > fixed length, and the interpreter doesn't actually care about the
> > character set the data is in.
>
>Why is this necessary at all?  Wouldn't it be prefereable to have all
>access go through the String vtable regardless of the encoding?

Speed. We're going to take something of a hit decomposing to ops as it 
is--if we can safely cheat, I'm OK with mandating it to be required. :)

> > =item encoding
> >
> > Pointer to the library that handles the string encoding. Encoding is
> > basically how the stream of bytes pointed to by C can be
> > turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
> > 5, or Shift JIS. Unicode, Ascii, or EBCDIC are B encodings.first
>
>.first?

Trailing buffer gook.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: String rationale

2001-10-25 Thread Sam Tregar

On Thu, 25 Oct 2001, Dan Sugalski wrote:

> The only bits of the interpreter that much care about the
> string data are the regex engine parts, and those only operate on
> fixed-sized data.

Care to elaborate?  I thought the mandate from Larry was to have regexes
compile down to a stream of string ops.  Doesn't that mean it should work
regardless of the encoding of the string?

> The interpreter can only peek inside a string if that string is of
> fixed length, and the interpreter doesn't actually care about the
> character set the data is in.

Why is this necessary at all?  Wouldn't it be prefereable to have all
access go through the String vtable regardless of the encoding?

> =item encoding
>
> Pointer to the library that handles the string encoding. Encoding is
> basically how the stream of bytes pointed to by C can be
> turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
> 5, or Shift JIS. Unicode, Ascii, or EBCDIC are B encodings.first

.first?

Aside from the above, this was a nice refresher.

-sam