Re: String rationale
In message <[EMAIL PROTECTED]> Simon Cozens <[EMAIL PROTECTED]> wrote: > As things stand, that won't work, because you're doing a string lookup in one > of the core functions, and you still need some way of registering incoming > stuff. With an enum, you can keep hold of a fake encoding_max, and hand > encoding_max++ to the initialisation function for each encoding. Well there won't be any point in it being an enum rather that an integer unless some of them are going to be preallocated. I'm not sure if the encoding and character types will need to know their own index numbers but if we do then they can be told at initialisation time, yes. I absolutely intend that the current hard coded strings in the core will go away in due course though. When you look up an encoding or character type by name it will first check a hash table or something to see if it is already loaded and if not it will look for it on disk and load it in, allocate it a number, and add it to the hash table for future reference. Hence the current strcmp junk in the lookup functions will go away. In much the same way the byte code will have some sort of table of names which it will look up as it is loaded rather than the current hard coding of name to number mappings in the byte code. So all I need now to make all this work is hash tables and dynamic code loading ;-) Any volunteers... Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: String rationale
On Thu, Nov 01, 2001 at 02:18:17PM +, Tom Hughes wrote: > > Could you try rewriting them using an enum, like the vtable stuff and > > the original string encoding stuff does? > > Allocating them globally is not possible if we're going allow people > to add arbitrary encodings and character sets - as things stand adding > the foo encoding will be as simple as adding foo.so to the encodings > directory. As things stand, that won't work, because you're doing a string lookup in one of the core functions, and you still need some way of registering incoming stuff. With an enum, you can keep hold of a fake encoding_max, and hand encoding_max++ to the initialisation function for each encoding. -- Relf Test Passed.
Re: String rationale
In message <[EMAIL PROTECTED]> Simon Cozens <[EMAIL PROTECTED]> wrote: > On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote: > > The encoding_lookup() and chartype_lookup() routines will obviously > > need to load the relevant libraries on the fly when we have support > > for that. > > Could you try rewriting them using an enum, like the vtable stuff and > the original string encoding stuff does? The intention is that when an encoding or character type is loaded it will be allocated a unique ID number that can be used internally to refer to it, but that the number will only valid for the duration of that instance of parrot rather than being persistent. That's certainly the way Dan described it happening in his rationale which is what my code is based on. Allocating them globally is not possible if we're going allow people to add arbitrary encodings and character sets - as things stand adding the foo encoding will be as simple as adding foo.so to the encodings directory. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: String rationale
On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote: > The encoding_lookup() and chartype_lookup() routines will obviously > need to load the relevant libraries on the fly when we have support > for that. Could you try rewriting them using an enum, like the vtable stuff and the original string encoding stuff does? -- An algorithm must be seen to be believed. -- D.E. Knuth
Re: String rationale
In message <[EMAIL PROTECTED]> Tom Hughes <[EMAIL PROTECTED]> wrote: > In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: > > > > >Attached is my first pass at this - it's not fully ready yet but > > >is something for people to cast an eye over before I spend lots of > > >time going down the wrong path ;-) > > > > It looks pretty good on first glance. > > I've done a bit more work now, and the latest version is attached. Unless anybody has objections I plan to commit this work shortly... Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
At 07:16 PM 10/29/2001 -0500, James Mastros wrote: >Yeah. But that's a convention thing, I think. I also think that most >people won't go to the bother of writing conversion functions that they >don't have to. What we need to worry about is both, say, big5 and shiftjis >writing both of the conversions. And it shouldn't come up all that much, >because Unicode is /supposted to be/ lossless for most things. Supposed to be, yep. Whether it *is* or not is another issue entirely. :) > > I suspect that the encode and decode methods in the encoding vtable > > are enough for doing chr/ord aren't they? >Hmm... come to think of it, yes. chr will always create a utf32-encoded >string with the given charset number (or unicode for the two-arg version), >ord will return the codepoint within the current charset. Erk. No. chr should give you a string in the encoding you've selected, or the default encoding if you've not selected one. That may not be (probably won't be) UTF32. >(This, BTW, means that only encodings that feel like it have to provide >either, but all encodings must be able to convert to utf32.) More or less, yep. Everyone has to go to UTF32. Direct encoding to encoding is optional. Encouraged in those cases where it's either quicker or less uncertain. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
In message <[EMAIL PROTECTED]> James Mastros <[EMAIL PROTECTED]> wrote: > On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote: > > > I suspect that the encode and decode methods in the encoding vtable > > are enough for doing chr/ord aren't they? > > Hmm... come to think of it, yes. chr will always create a utf32-encoded > string with the given charset number (or unicode for the two-arg version), > ord will return the codepoint within the current charset. I hope it will create a string with the given charset number and using the default encoding for that charset. Asking for an ASCII character and getting it UTF-32 encoded would be more that a little bizarre. If I say chr(65,ASCII) then I would expect to get a single byte encoded string... > (This, BTW, means that only encodings that feel like it have to provide > either, but all encodings must be able to convert to utf32.) The way I've written it, any encoding can convert to any encoding at all, because there is no conversion at that level. I just decode a character from the source, transcode it at the character level, and then encode it to the destination. If an encoding cannot handle the full range of character values for a character set then you will get an exception when it tries to encode an out of range character. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: String rationale
On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote: > > 2) But either can support converting directly if it wants. > The danger is that everybody tries to be clever and support direct > conversion to and from as many other character sets as possible, which > leads to lots of duplication. Yeah. But that's a convention thing, I think. I also think that most people won't go to the bother of writing conversion functions that they don't have to. What we need to worry about is both, say, big5 and shiftjis writing both of the conversions. And it shouldn't come up all that much, because Unicode is /supposted to be/ lossless for most things. > I have already been thinking about this although it does get more > complicated as you have to consider the encoding as well - if you > have a single byte encoded ASCII string then transcoding to a single > byte encoded Latin-1 string is a no-op, but that may not be true for > other encodings if such a thing makes sense for those character types. Hm. All the encodings I can think of (which is rather limited -- the UTFs), you can scan for units (IE ints of the proper size) > 0x7f, and if you don't find any, it's 7bit, and you can just change the charset marker without doing any work. In any case, it's up to the encoding to tell if we've got a pure 7bit string. If that's complicated for it, it can just always return FALSE. > I suspect that the encode and decode methods in the encoding vtable > are enough for doing chr/ord aren't they? Hmm... come to think of it, yes. chr will always create a utf32-encoded string with the given charset number (or unicode for the two-arg version), ord will return the codepoint within the current charset. (This, BTW, means that only encodings that feel like it have to provide either, but all encodings must be able to convert to utf32.) Powers-that-be (I'm looking at you, Dan), is that good? -=- James Mastros
Re: String rationale
In message <[EMAIL PROTECTED]> James Mastros <[EMAIL PROTECTED]> wrote: > > That leaves the third, which is what I have implemented. When looking to > > transcode from A to B it will first ask A if can it transcode to B and > > if that fails then it will ask B if it can transcode from A. > I propose another variant on this: > If that fails, it asks A to transcode to Unicode, and B to transcode from > Unicode. (Not Unicode to transcode to B; Unicode implements no transcodings.) My code does that, though at a slightly higher level. If you look at string_transcode() you will see that if it can't find a direct mapping it will go via unicode. If C had closures then I'd have buried that down in the chartype_lookup_transcoder() layer, but it doesn't so I couldn't ;-) > > The problem it raises is, whois reponsible for transcoding from ASCII to > > Latin-1? and back again? If we're not careful both ends will implement > > both translations and we will have effective duplication. > 1) Neither. Each must support transcoding to and from Unicode. Absolutely. > 2) But either can support converting directly if it wants. The danger is that everybody tries to be clever and support direct conversion to and from as many other character sets as possible, which leads to lots of duplication. > I also think that, for efficency, we might want a "7-bit chars match ASCII" > flag, since most charactersets do, and that means that we don't have to deal > with the overhead for strings that fit in 7 bits. This smells of premature > optimization, though, so sombody just file this away in their heads for > future reference. I have already been thinking about this although it does get more complicated as you have to consider the encoding as well - if you have a single byte encoded ASCII string then transcoding to a single byte encoded Latin-1 string is a no-op, but that may not be true for other encodings if such a thing makes sense for those character types. > (BTW, for those paying attention, I'm waiting on this discussion for my > chr/ord patch, since I want them in terms of charsets, not encodings.) I suspect that the encode and decode methods in the encoding vtable are enough for doing chr/ord aren't they? Surely chr() is just encoding the argument in the chosen encoding (which can be the default encoding for the char type if you want) and then setting the type and encoding of the resulting string appropriately. Equally ord() is decoding the first character of the string to get a number. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
On Mon, Oct 29, 2001 at 08:32:16PM +, Tom Hughes wrote: > We have established that the first two will not work because of the > unicode problem. Hm. I think instead of requiring Unicode to support everything, we should require Unicode to support /nothing/. If A and B have no mutual transcoding function, we should use Unicode as a intermediary. (This means that charsets that are lossy to unicode need to transcode to eachother directly, like Far Eastern sets. (And Klingon, but that can't transcode to anything.)) This still makes Unicode a special case, but not a terrible one. (In fact, unicode can be treated like any other charset, except when we want to trancode between mutualy incompatable sets, since we always try both A->B and A<-B. (Notational note: A->B means that A is implementing a transcoding from itself to B. A<-B means that A is implementing a transcoding from B to A.) > That leaves the third, which is what I have implemented. When looking to > transcode from A to B it will first ask A if can it transcode to B and > if that fails then it will ask B if it can transcode from A. I propose another variant on this: If that fails, it asks A to transcode to Unicode, and B to transcode from Unicode. (Not Unicode to transcode to B; Unicode implements no transcodings.) > The problem it raises is, whois reponsible for transcoding from ASCII to > Latin-1? and back again? If we're not careful both ends will implement > both translations and we will have effective duplication. 1) Neither. Each must support transcoding to and from Unicode. 2) But either can support converting directly if it wants. I also think that, for efficency, we might want a "7-bit chars match ASCII" flag, since most charactersets do, and that means that we don't have to deal with the overhead for strings that fit in 7 bits. This smells of premature optimization, though, so sombody just file this away in their heads for future reference. That would also mean that neither is responsible for converting between Latin-1 and ASCII, because core will do it, most of the time, and the rest of the time, it isn't possible. Hm. But it isn't possible _losslessly_, though it is possibly lossfuly. IMHO, there should be two ways to transcode, or the transcoding function should flag to it's caller somehow. (Sorry for the train-of-thought, but I think it's decently clear.) (BTW, for those paying attention, I'm waiting on this discussion for my chr/ord patch, since I want them in terms of charsets, not encodings.) -=- James Mastros
RE: String rationale
In message <[EMAIL PROTECTED]> "Stephen Howard" <[EMAIL PROTECTED]> wrote: > right. I had just keyed in on this from Tom's message: > > "My code currently allows either set to provide the transform on the > grounds that otherwise the unicode module would have to either know > how to convert to everything else or from everything else." > > ...which seemed to posit that Unicode module could be responsible for > all the transcodings to and from it's own character set, which seemed > backwards to me. I was only positing it long enough to acknowledge that such a rule was untenable. What it comes down to is that there are three possibles rules, namely: 1. Each character set defines transforms from itself to other character sets. 2. Each character set defines transforms to itself from other character sets. 3. Each character set defines transforms both from itself to other character sets and from other character sets to itself. We have established that the first two will not work because of the unicode problem. That leaves the third, which is what I have implemented. When looking to transcode from A to B it will first ask A if can it transcode to B and if that fails then it will ask B if it can transcode from A. That way each character set can manage it's own translations both to and from unicode as we require. The problem it raises is, whois reponsible for transcoding from ASCII to Latin-1? and back again? If we're not careful both ends will implement both translations and we will have effective duplication. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
RE: String rationale
right. I had just keyed in on this from Tom's message: "My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else." ...which seemed to posit that Unicode module could be responsible for all the transcodings to and from it's own character set, which seemed backwards to me. -Stephen -Original Message- From: Dan Sugalski [mailto:[EMAIL PROTECTED]] Sent: Monday, October 29, 2001 02:43 PM To: Stephen Howard; Tom Hughes; [EMAIL PROTECTED] Subject: RE: String rationale At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote: >You might consider requiring all character sets be able to convert to Unicode, That's already a requirement. All character sets must be able to go to or come from Unicode. They can do others if they want, but it's not required. (And we'll have to figure out how to allow that reasonably efficiently) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: String rationale
At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote: >You might consider requiring all character sets be able to convert to Unicode, That's already a requirement. All character sets must be able to go to or come from Unicode. They can do others if they want, but it's not required. (And we'll have to figure out how to allow that reasonably efficiently) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: String rationale
You might consider requiring all character sets be able to convert to Unicode, and otherwise only have to know how to convert other character sets to it's own set. -Original Message- From: Tom Hughes [mailto:[EMAIL PROTECTED]] Sent: Monday, October 29, 2001 02:31 PM To: [EMAIL PROTECTED] Subject: Re: String rationale In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: > > >Attached is my first pass at this - it's not fully ready yet but > >is something for people to cast an eye over before I spend lots of > >time going down the wrong path ;-) > > It looks pretty good on first glance. I've done a bit more work now, and the latest version is attached. This version can do transcoding. The intention is that there will be some sort of cache in chartype_lookup_transcoder to avoid repeating the expensive lookups by name too much. One interesting question is who is responsible for transcoding from character set A to character set B - is it A or B? and how about the other way? My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: > > >Attached is my first pass at this - it's not fully ready yet but > >is something for people to cast an eye over before I spend lots of > >time going down the wrong path ;-) > > It looks pretty good on first glance. I've done a bit more work now, and the latest version is attached. This version can do transcoding. The intention is that there will be some sort of cache in chartype_lookup_transcoder to avoid repeating the expensive lookups by name too much. One interesting question is who is responsible for transcoding from character set A to character set B - is it A or B? and how about the other way? My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ # This is a patch for parrot to update it to parrot-ns # # To apply this patch: # STEP 1: Chdir to the source directory. # STEP 2: Run the 'applypatch' program with this patch file as input. # # If you do not have 'applypatch', it is part of the 'makepatch' package # that you can fetch from the Comprehensive Perl Archive Network: # http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz # In the above URL, 'x' should be 2 or higher. # # To apply this patch without the use of 'applypatch': # STEP 1: Chdir to the source directory. # If you have a decent Bourne-type shell: # STEP 2: Run the shell with this file as input. # If you don't have such a shell, you may need to manually create/delete # the files/directories as shown below. # STEP 3: Run the 'patch' program with this file as input. # # These are the commands needed to create/delete files/directories: # mkdir 'chartypes' chmod 0755 'chartypes' mkdir 'encodings' chmod 0755 'encodings' rm -f 'transcode.c' rm -f 'strutf8.c' rm -f 'strutf32.c' rm -f 'strutf16.c' rm -f 'strnative.c' rm -f 'include/parrot/transcode.h' rm -f 'include/parrot/strutf8.h' rm -f 'include/parrot/strutf32.h' rm -f 'include/parrot/strutf16.h' rm -f 'include/parrot/strnative.h' touch 'chartype.c' chmod 0644 'chartype.c' touch 'chartypes/unicode.c' chmod 0644 'chartypes/unicode.c' touch 'chartypes/usascii.c' chmod 0644 'chartypes/usascii.c' touch 'encoding.c' chmod 0644 'encoding.c' touch 'encodings/singlebyte.c' chmod 0644 'encodings/singlebyte.c' touch 'encodings/utf16.c' chmod 0644 'encodings/utf16.c' touch 'encodings/utf32.c' chmod 0644 'encodings/utf32.c' touch 'encodings/utf8.c' chmod 0644 'encodings/utf8.c' touch 'include/parrot/chartype.h' chmod 0644 'include/parrot/chartype.h' touch 'include/parrot/encoding.h' chmod 0644 'include/parrot/encoding.h' # # This command terminates the shell and need not be executed manually. exit # End of Preamble Patch data follows diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST' Index: ./MANIFEST *** ./MANIFEST Sun Oct 28 17:11:21 2001 --- ./MANIFEST Sun Oct 28 17:11:07 2001 *** *** 1,5 --- 1,8 assemble.pl ChangeLog + chartype.c + chartypes/unicode.c + chartypes/usascii.c classes/genclass.pl classes/intclass.c classes/scalarclass.c *** *** 15,20 --- 18,28 docs/parrotbyte.pod docs/strings.pod docs/vtables.pod + encoding.c + encodings/singlebyte.c + encodings/utf8.c + encodings/utf16.c + encodings/utf32.c examples/assembly/bsr.pasm examples/assembly/call.pasm examples/assembly/euclid.pasm *** *** 30,35 --- 38,45 global_setup.c hints/mswin32.pl hints/vms.pl + include/parrot/chartype.h + include/parrot/encoding.h include/parrot/events.h include/parrot/exceptions.h include/parrot/global_setup.h *** *** 46,56 include/parrot/runops_cores.h include/parrot/stacks.h include/parrot/string.h - include/parrot/strnative.h - include/parrot/strutf16.h - include/parrot/strutf32.h - include/parrot/strutf8.h - include/parrot/transcode.h include/parrot/trace.h include/parrot/unicode.h interpreter.c --- 56,61 *** *** 108,117 runops_cores.c stacks.c string.c - strnative.c - strutf16.c - strutf32.c - strutf8.c test_c.in test_main.c Test/More.pm --- 113,118 *** *** 129,135 t/op/time.t t/op/trans.t trace.c - transcode.c Types_pm.in vtable_h.pl vtable.tbl --- 130,135 diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in' Index: ./Makefile.in *** ./Makefile.in Wed Oct 24 19:23:47 2001 --- ./Makefile.in Sat Oct 27 15:02:45 2001 *** *** 11,19 $(INC)/pmc.h $(INC)/resources.h O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \ ! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \ ! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \ ! trace$(O) vtable_ops$(O) cla
Re: String rationale
At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: >In message <[EMAIL PROTECTED]> > Tom Hughes <[EMAIL PROTECTED]> wrote: > > > Other than that it looked quite good and I'll probably start looking at > > bending the existing code into the new model over the weekend. > >Attached is my first pass at this - it's not fully ready yet but >is something for people to cast an eye over before I spend lots of >time going down the wrong path ;-) It looks pretty good on first glance. >The packfile stuff is just a hack to make it work for now. Presumably >we will have to modify the byte code format to record the string types >as names or something so we can look them up properly? Yup. I think tagging the strings with a few type integers and a set of name->type tables in the bytecode are going to be needed for this. >String comparison is not language sensitive here - as before it just >compares based on character values. I'm still unsure as to how to properly handle locale-aware comparison, which is an interesting problem in and of itself. Luckily we just need to make the facilities for it, and someone else handles the policy. :) >Other than that I think it's aiming in the right direction and it does >pass all the tests... Please correct me if I'm wrong. Let me mull it over a bit. I think I'm going to commit it, but a second think on it won't hurt. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
In message <[EMAIL PROTECTED]> Tom Hughes <[EMAIL PROTECTED]> wrote: > Attached is my first pass at this - it's not fully ready yet but > is something for people to cast an eye over before I spend lots of > time going down the wrong path ;-) Before anybody else spots, let me just add what I forget to mention in my original post, which is that transcoding isn't implemented yet as I'm still thinking about the best way to do it. There is a hook in place ready for it though. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
In message <[EMAIL PROTECTED]> Tom Hughes <[EMAIL PROTECTED]> wrote: > Other than that it looked quite good and I'll probably start looking at > bending the existing code into the new model over the weekend. Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) The encoding_lookup() and chartype_lookup() routines will obviously need to load the relevant libraries on the fly when we have support for that. The packfile stuff is just a hack to make it work for now. Presumably we will have to modify the byte code format to record the string types as names or something so we can look them up properly? String comparison is not language sensitive here - as before it just compares based on character values. Other than that I think it's aiming in the right direction and it does pass all the tests... Please correct me if I'm wrong. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ # This is a patch for parrot to update it to parrot-ns # # To apply this patch: # STEP 1: Chdir to the source directory. # STEP 2: Run the 'applypatch' program with this patch file as input. # # If you do not have 'applypatch', it is part of the 'makepatch' package # that you can fetch from the Comprehensive Perl Archive Network: # http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz # In the above URL, 'x' should be 2 or higher. # # To apply this patch without the use of 'applypatch': # STEP 1: Chdir to the source directory. # If you have a decent Bourne-type shell: # STEP 2: Run the shell with this file as input. # If you don't have such a shell, you may need to manually create/delete # the files/directories as shown below. # STEP 3: Run the 'patch' program with this file as input. # # These are the commands needed to create/delete files/directories: # mkdir 'chartypes' chmod 0755 'chartypes' mkdir 'encodings' chmod 0755 'encodings' rm -f 'transcode.c' rm -f 'strutf8.c' rm -f 'strutf32.c' rm -f 'strutf16.c' rm -f 'strnative.c' rm -f 'include/parrot/transcode.h' rm -f 'include/parrot/strutf8.h' rm -f 'include/parrot/strutf32.h' rm -f 'include/parrot/strutf16.h' rm -f 'include/parrot/strnative.h' touch 'chartype.c' chmod 0644 'chartype.c' touch 'chartypes/unicode.c' chmod 0644 'chartypes/unicode.c' touch 'chartypes/usascii.c' chmod 0644 'chartypes/usascii.c' touch 'encoding.c' chmod 0644 'encoding.c' touch 'encodings/singlebyte.c' chmod 0644 'encodings/singlebyte.c' touch 'encodings/utf16.c' chmod 0644 'encodings/utf16.c' touch 'encodings/utf32.c' chmod 0644 'encodings/utf32.c' touch 'encodings/utf8.c' chmod 0644 'encodings/utf8.c' touch 'include/parrot/chartype.h' chmod 0644 'include/parrot/chartype.h' touch 'include/parrot/encoding.h' chmod 0644 'include/parrot/encoding.h' # # This command terminates the shell and need not be executed manually. exit # End of Preamble Patch data follows diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST' Index: ./MANIFEST *** ./MANIFEST Wed Oct 24 22:16:51 2001 --- ./MANIFEST Sat Oct 27 14:59:43 2001 *** *** 1,5 --- 1,8 assemble.pl ChangeLog + chartype.c + chartypes/unicode.c + chartypes/usascii.c classes/genclass.pl classes/intclass.c config_h.in *** *** 14,19 --- 17,27 docs/parrotbyte.pod docs/strings.pod docs/vtables.pod + encoding.c + encodings/singlebyte.c + encodings/utf8.c + encodings/utf16.c + encodings/utf32.c examples/assembly/bsr.pasm examples/assembly/call.pasm examples/assembly/euclid.pasm *** *** 29,34 --- 37,44 global_setup.c hints/mswin32.pl hints/vms.pl + include/parrot/chartype.h + include/parrot/encoding.h include/parrot/events.h include/parrot/exceptions.h include/parrot/global_setup.h *** *** 45,55 include/parrot/runops_cores.h include/parrot/stacks.h include/parrot/string.h - include/parrot/strnative.h - include/parrot/strutf16.h - include/parrot/strutf32.h - include/parrot/strutf8.h - include/parrot/transcode.h include/parrot/trace.h include/parrot/unicode.h interpreter.c --- 55,60 *** *** 107,116 runops_cores.c stacks.c string.c - strnative.c - strutf16.c - strutf32.c - strutf8.c test_c.in test_main.c Test/More.pm --- 112,117 *** *** 128,134 t/op/time.t t/op/trans.t trace.c - transcode.c Types_pm.in vtable_h.pl vtable.tbl --- 129,134 diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in' Index: ./Makefile.in *** ./Makefile.in Wed Oct 24 19:23:47 2001 --- ./Makefile.in Sat Oct 27 15:02:45 2001 *** *** 11,19 $(INC)/pmc.h $(INC)/resources.h O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \ ! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \ ! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \ ! trace$(O) vtable_op
Re: String rationale
At 11:59 PM 10/25/2001 +0100, Tom Hughes wrote: >In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > =item type > > > > What the character set or type of data is encoded in the buffer. This > > includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, > > Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a > > combination of type and encoding. I'll update the doc as soon as I can > > reasonablty separate the two) > >Isn't this going to need to be a vtable pointer like encoding is? Yup. I'd intended it to be an index into a table of character set functions. Jarkko has convinced me that it's better to have it as a vtable pointer, but I haven't had a chance to update the docs yet. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > =item type > > What the character set or type of data is encoded in the buffer. This > includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, > Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a > combination of type and encoding. I'll update the doc as soon as I can > reasonablty separate the two) Isn't this going to need to be a vtable pointer like encoding is? Only some things (like character classification and at least some transcoding tasks) will be character set based rather than encoding based. Other than that it looked quite good and I'll probably start looking at bending the existing code into the new model over the weekend. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
At 12:19 PM 10/25/2001 -0400, Sam Tregar wrote: >On Thu, 25 Oct 2001, Dan Sugalski wrote: > > > The only bits of the interpreter that much care about the > > string data are the regex engine parts, and those only operate on > > fixed-sized data. > >Care to elaborate? I thought the mandate from Larry was to have regexes >compile down to a stream of string ops. Doesn't that mean it should work >regardless of the encoding of the string? Since the encoding just determines how the abstract code point numbers are represented in bytes, I'm OK with requiring strings we process internally to be in a fixed-size version. And regexes will be done with a stream of parrot opcodes, presuming that's not too slow. There'll be ops to reference the code point at position X in a string and check to see if its in a list of other code points and suchlike things. Basically we'll peek under the covers, but only for fixed-length strings. > > The interpreter can only peek inside a string if that string is of > > fixed length, and the interpreter doesn't actually care about the > > character set the data is in. > >Why is this necessary at all? Wouldn't it be prefereable to have all >access go through the String vtable regardless of the encoding? Speed. We're going to take something of a hit decomposing to ops as it is--if we can safely cheat, I'm OK with mandating it to be required. :) > > =item encoding > > > > Pointer to the library that handles the string encoding. Encoding is > > basically how the stream of bytes pointed to by C can be > > turned into a stream of 32-bit codepoints. Examples include UTF-8, Big > > 5, or Shift JIS. Unicode, Ascii, or EBCDIC are B encodings.first > >.first? Trailing buffer gook. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
On Thu, 25 Oct 2001, Dan Sugalski wrote: > The only bits of the interpreter that much care about the > string data are the regex engine parts, and those only operate on > fixed-sized data. Care to elaborate? I thought the mandate from Larry was to have regexes compile down to a stream of string ops. Doesn't that mean it should work regardless of the encoding of the string? > The interpreter can only peek inside a string if that string is of > fixed length, and the interpreter doesn't actually care about the > character set the data is in. Why is this necessary at all? Wouldn't it be prefereable to have all access go through the String vtable regardless of the encoding? > =item encoding > > Pointer to the library that handles the string encoding. Encoding is > basically how the stream of bytes pointed to by C can be > turned into a stream of 32-bit codepoints. Examples include UTF-8, Big > 5, or Shift JIS. Unicode, Ascii, or EBCDIC are B encodings.first .first? Aside from the above, this was a nice refresher. -sam