Re: String rationale
In message [EMAIL PROTECTED] Simon Cozens [EMAIL PROTECTED] wrote: On Sat, Oct 27, 2001 at 04:23:48PM +0100, Tom Hughes wrote: The encoding_lookup() and chartype_lookup() routines will obviously need to load the relevant libraries on the fly when we have support for that. Could you try rewriting them using an enum, like the vtable stuff and the original string encoding stuff does? The intention is that when an encoding or character type is loaded it will be allocated a unique ID number that can be used internally to refer to it, but that the number will only valid for the duration of that instance of parrot rather than being persistent. That's certainly the way Dan described it happening in his rationale which is what my code is based on. Allocating them globally is not possible if we're going allow people to add arbitrary encodings and character sets - as things stand adding the foo encoding will be as simple as adding foo.so to the encodings directory. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu
Re: String rationale
At 07:16 PM 10/29/2001 -0500, James Mastros wrote: Yeah. But that's a convention thing, I think. I also think that most people won't go to the bother of writing conversion functions that they don't have to. What we need to worry about is both, say, big5 and shiftjis writing both of the conversions. And it shouldn't come up all that much, because Unicode is /supposted to be/ lossless for most things. Supposed to be, yep. Whether it *is* or not is another issue entirely. :) I suspect that the encode and decode methods in the encoding vtable are enough for doing chr/ord aren't they? Hmm... come to think of it, yes. chr will always create a utf32-encoded string with the given charset number (or unicode for the two-arg version), ord will return the codepoint within the current charset. Erk. No. chr should give you a string in the encoding you've selected, or the default encoding if you've not selected one. That may not be (probably won't be) UTF32. (This, BTW, means that only encodings that feel like it have to provide either, but all encodings must be able to convert to utf32.) More or less, yep. Everyone has to go to UTF32. Direct encoding to encoding is optional. Encouraged in those cases where it's either quicker or less uncertain. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) It looks pretty good on first glance. I've done a bit more work now, and the latest version is attached. This version can do transcoding. The intention is that there will be some sort of cache in chartype_lookup_transcoder to avoid repeating the expensive lookups by name too much. One interesting question is who is responsible for transcoding from character set A to character set B - is it A or B? and how about the other way? My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ # This is a patch for parrot to update it to parrot-ns # # To apply this patch: # STEP 1: Chdir to the source directory. # STEP 2: Run the 'applypatch' program with this patch file as input. # # If you do not have 'applypatch', it is part of the 'makepatch' package # that you can fetch from the Comprehensive Perl Archive Network: # http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz # In the above URL, 'x' should be 2 or higher. # # To apply this patch without the use of 'applypatch': # STEP 1: Chdir to the source directory. # If you have a decent Bourne-type shell: # STEP 2: Run the shell with this file as input. # If you don't have such a shell, you may need to manually create/delete # the files/directories as shown below. # STEP 3: Run the 'patch' program with this file as input. # # These are the commands needed to create/delete files/directories: # mkdir 'chartypes' chmod 0755 'chartypes' mkdir 'encodings' chmod 0755 'encodings' rm -f 'transcode.c' rm -f 'strutf8.c' rm -f 'strutf32.c' rm -f 'strutf16.c' rm -f 'strnative.c' rm -f 'include/parrot/transcode.h' rm -f 'include/parrot/strutf8.h' rm -f 'include/parrot/strutf32.h' rm -f 'include/parrot/strutf16.h' rm -f 'include/parrot/strnative.h' touch 'chartype.c' chmod 0644 'chartype.c' touch 'chartypes/unicode.c' chmod 0644 'chartypes/unicode.c' touch 'chartypes/usascii.c' chmod 0644 'chartypes/usascii.c' touch 'encoding.c' chmod 0644 'encoding.c' touch 'encodings/singlebyte.c' chmod 0644 'encodings/singlebyte.c' touch 'encodings/utf16.c' chmod 0644 'encodings/utf16.c' touch 'encodings/utf32.c' chmod 0644 'encodings/utf32.c' touch 'encodings/utf8.c' chmod 0644 'encodings/utf8.c' touch 'include/parrot/chartype.h' chmod 0644 'include/parrot/chartype.h' touch 'include/parrot/encoding.h' chmod 0644 'include/parrot/encoding.h' # # This command terminates the shell and need not be executed manually. exit # End of Preamble Patch data follows diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST' Index: ./MANIFEST *** ./MANIFEST Sun Oct 28 17:11:21 2001 --- ./MANIFEST Sun Oct 28 17:11:07 2001 *** *** 1,5 --- 1,8 assemble.pl ChangeLog + chartype.c + chartypes/unicode.c + chartypes/usascii.c classes/genclass.pl classes/intclass.c classes/scalarclass.c *** *** 15,20 --- 18,28 docs/parrotbyte.pod docs/strings.pod docs/vtables.pod + encoding.c + encodings/singlebyte.c + encodings/utf8.c + encodings/utf16.c + encodings/utf32.c examples/assembly/bsr.pasm examples/assembly/call.pasm examples/assembly/euclid.pasm *** *** 30,35 --- 38,45 global_setup.c hints/mswin32.pl hints/vms.pl + include/parrot/chartype.h + include/parrot/encoding.h include/parrot/events.h include/parrot/exceptions.h include/parrot/global_setup.h *** *** 46,56 include/parrot/runops_cores.h include/parrot/stacks.h include/parrot/string.h - include/parrot/strnative.h - include/parrot/strutf16.h - include/parrot/strutf32.h - include/parrot/strutf8.h - include/parrot/transcode.h include/parrot/trace.h include/parrot/unicode.h interpreter.c --- 56,61 *** *** 108,117 runops_cores.c stacks.c string.c - strnative.c - strutf16.c - strutf32.c - strutf8.c test_c.in test_main.c Test/More.pm --- 113,118 *** *** 129,135 t/op/time.t t/op/trans.t trace.c - transcode.c Types_pm.in vtable_h.pl vtable.tbl --- 130,135 diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in' Index: ./Makefile.in *** ./Makefile.in Wed Oct 24 19:23:47 2001 --- ./Makefile.in Sat Oct 27 15:02:45 2001 *** *** 11,19 $(INC)/pmc.h $(INC)/resources.h O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \ ! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \ ! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \ ! trace$(O) vtable_ops$(O)
RE: String rationale
You might consider requiring all character sets be able to convert to Unicode, and otherwise only have to know how to convert other character sets to it's own set. -Original Message- From: Tom Hughes [mailto:[EMAIL PROTECTED]] Sent: Monday, October 29, 2001 02:31 PM To: [EMAIL PROTECTED] Subject: Re: String rationale In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) It looks pretty good on first glance. I've done a bit more work now, and the latest version is attached. This version can do transcoding. The intention is that there will be some sort of cache in chartype_lookup_transcoder to avoid repeating the expensive lookups by name too much. One interesting question is who is responsible for transcoding from character set A to character set B - is it A or B? and how about the other way? My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
RE: String rationale
At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote: You might consider requiring all character sets be able to convert to Unicode, That's already a requirement. All character sets must be able to go to or come from Unicode. They can do others if they want, but it's not required. (And we'll have to figure out how to allow that reasonably efficiently) Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: String rationale
right. I had just keyed in on this from Tom's message: My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. ...which seemed to posit that Unicode module could be responsible for all the transcodings to and from it's own character set, which seemed backwards to me. -Stephen -Original Message- From: Dan Sugalski [mailto:[EMAIL PROTECTED]] Sent: Monday, October 29, 2001 02:43 PM To: Stephen Howard; Tom Hughes; [EMAIL PROTECTED] Subject: RE: String rationale At 02:52 PM 10/29/2001 -0500, Stephen Howard wrote: You might consider requiring all character sets be able to convert to Unicode, That's already a requirement. All character sets must be able to go to or come from Unicode. They can do others if they want, but it's not required. (And we'll have to figure out how to allow that reasonably efficiently) Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: String rationale
In message [EMAIL PROTECTED] Stephen Howard [EMAIL PROTECTED] wrote: right. I had just keyed in on this from Tom's message: My code currently allows either set to provide the transform on the grounds that otherwise the unicode module would have to either know how to convert to everything else or from everything else. ...which seemed to posit that Unicode module could be responsible for all the transcodings to and from it's own character set, which seemed backwards to me. I was only positing it long enough to acknowledge that such a rule was untenable. What it comes down to is that there are three possibles rules, namely: 1. Each character set defines transforms from itself to other character sets. 2. Each character set defines transforms to itself from other character sets. 3. Each character set defines transforms both from itself to other character sets and from other character sets to itself. We have established that the first two will not work because of the unicode problem. That leaves the third, which is what I have implemented. When looking to transcode from A to B it will first ask A if can it transcode to B and if that fails then it will ask B if it can transcode from A. That way each character set can manage it's own translations both to and from unicode as we require. The problem it raises is, whois reponsible for transcoding from ASCII to Latin-1? and back again? If we're not careful both ends will implement both translations and we will have effective duplication. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
On Mon, Oct 29, 2001 at 08:32:16PM +, Tom Hughes wrote: We have established that the first two will not work because of the unicode problem. Hm. I think instead of requiring Unicode to support everything, we should require Unicode to support /nothing/. If A and B have no mutual transcoding function, we should use Unicode as a intermediary. (This means that charsets that are lossy to unicode need to transcode to eachother directly, like Far Eastern sets. (And Klingon, but that can't transcode to anything.)) This still makes Unicode a special case, but not a terrible one. (In fact, unicode can be treated like any other charset, except when we want to trancode between mutualy incompatable sets, since we always try both A-B and A-B. (Notational note: A-B means that A is implementing a transcoding from itself to B. A-B means that A is implementing a transcoding from B to A.) That leaves the third, which is what I have implemented. When looking to transcode from A to B it will first ask A if can it transcode to B and if that fails then it will ask B if it can transcode from A. I propose another variant on this: If that fails, it asks A to transcode to Unicode, and B to transcode from Unicode. (Not Unicode to transcode to B; Unicode implements no transcodings.) The problem it raises is, whois reponsible for transcoding from ASCII to Latin-1? and back again? If we're not careful both ends will implement both translations and we will have effective duplication. 1) Neither. Each must support transcoding to and from Unicode. 2) But either can support converting directly if it wants. I also think that, for efficency, we might want a 7-bit chars match ASCII flag, since most charactersets do, and that means that we don't have to deal with the overhead for strings that fit in 7 bits. This smells of premature optimization, though, so sombody just file this away in their heads for future reference. That would also mean that neither is responsible for converting between Latin-1 and ASCII, because core will do it, most of the time, and the rest of the time, it isn't possible. Hm. But it isn't possible _losslessly_, though it is possibly lossfuly. IMHO, there should be two ways to transcode, or the transcoding function should flag to it's caller somehow. (Sorry for the train-of-thought, but I think it's decently clear.) (BTW, for those paying attention, I'm waiting on this discussion for my chr/ord patch, since I want them in terms of charsets, not encodings.) -=- James Mastros
Re: String rationale
In message [EMAIL PROTECTED] James Mastros [EMAIL PROTECTED] wrote: That leaves the third, which is what I have implemented. When looking to transcode from A to B it will first ask A if can it transcode to B and if that fails then it will ask B if it can transcode from A. I propose another variant on this: If that fails, it asks A to transcode to Unicode, and B to transcode from Unicode. (Not Unicode to transcode to B; Unicode implements no transcodings.) My code does that, though at a slightly higher level. If you look at string_transcode() you will see that if it can't find a direct mapping it will go via unicode. If C had closures then I'd have buried that down in the chartype_lookup_transcoder() layer, but it doesn't so I couldn't ;-) The problem it raises is, whois reponsible for transcoding from ASCII to Latin-1? and back again? If we're not careful both ends will implement both translations and we will have effective duplication. 1) Neither. Each must support transcoding to and from Unicode. Absolutely. 2) But either can support converting directly if it wants. The danger is that everybody tries to be clever and support direct conversion to and from as many other character sets as possible, which leads to lots of duplication. I also think that, for efficency, we might want a 7-bit chars match ASCII flag, since most charactersets do, and that means that we don't have to deal with the overhead for strings that fit in 7 bits. This smells of premature optimization, though, so sombody just file this away in their heads for future reference. I have already been thinking about this although it does get more complicated as you have to consider the encoding as well - if you have a single byte encoded ASCII string then transcoding to a single byte encoded Latin-1 string is a no-op, but that may not be true for other encodings if such a thing makes sense for those character types. (BTW, for those paying attention, I'm waiting on this discussion for my chr/ord patch, since I want them in terms of charsets, not encodings.) I suspect that the encode and decode methods in the encoding vtable are enough for doing chr/ord aren't they? Surely chr() is just encoding the argument in the chosen encoding (which can be the default encoding for the char type if you want) and then setting the type and encoding of the resulting string appropriately. Equally ord() is decoding the first character of the string to get a number. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
On Mon, Oct 29, 2001 at 11:20:47PM +, Tom Hughes wrote: 2) But either can support converting directly if it wants. The danger is that everybody tries to be clever and support direct conversion to and from as many other character sets as possible, which leads to lots of duplication. Yeah. But that's a convention thing, I think. I also think that most people won't go to the bother of writing conversion functions that they don't have to. What we need to worry about is both, say, big5 and shiftjis writing both of the conversions. And it shouldn't come up all that much, because Unicode is /supposted to be/ lossless for most things. I have already been thinking about this although it does get more complicated as you have to consider the encoding as well - if you have a single byte encoded ASCII string then transcoding to a single byte encoded Latin-1 string is a no-op, but that may not be true for other encodings if such a thing makes sense for those character types. Hm. All the encodings I can think of (which is rather limited -- the UTFs), you can scan for units (IE ints of the proper size) 0x7f, and if you don't find any, it's 7bit, and you can just change the charset marker without doing any work. In any case, it's up to the encoding to tell if we've got a pure 7bit string. If that's complicated for it, it can just always return FALSE. I suspect that the encode and decode methods in the encoding vtable are enough for doing chr/ord aren't they? Hmm... come to think of it, yes. chr will always create a utf32-encoded string with the given charset number (or unicode for the two-arg version), ord will return the codepoint within the current charset. (This, BTW, means that only encodings that feel like it have to provide either, but all encodings must be able to convert to utf32.) Powers-that-be (I'm looking at you, Dan), is that good? -=- James Mastros
Re: String rationale
In message [EMAIL PROTECTED] Tom Hughes [EMAIL PROTECTED] wrote: Other than that it looked quite good and I'll probably start looking at bending the existing code into the new model over the weekend. Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) The encoding_lookup() and chartype_lookup() routines will obviously need to load the relevant libraries on the fly when we have support for that. The packfile stuff is just a hack to make it work for now. Presumably we will have to modify the byte code format to record the string types as names or something so we can look them up properly? String comparison is not language sensitive here - as before it just compares based on character values. Other than that I think it's aiming in the right direction and it does pass all the tests... Please correct me if I'm wrong. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ # This is a patch for parrot to update it to parrot-ns # # To apply this patch: # STEP 1: Chdir to the source directory. # STEP 2: Run the 'applypatch' program with this patch file as input. # # If you do not have 'applypatch', it is part of the 'makepatch' package # that you can fetch from the Comprehensive Perl Archive Network: # http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz # In the above URL, 'x' should be 2 or higher. # # To apply this patch without the use of 'applypatch': # STEP 1: Chdir to the source directory. # If you have a decent Bourne-type shell: # STEP 2: Run the shell with this file as input. # If you don't have such a shell, you may need to manually create/delete # the files/directories as shown below. # STEP 3: Run the 'patch' program with this file as input. # # These are the commands needed to create/delete files/directories: # mkdir 'chartypes' chmod 0755 'chartypes' mkdir 'encodings' chmod 0755 'encodings' rm -f 'transcode.c' rm -f 'strutf8.c' rm -f 'strutf32.c' rm -f 'strutf16.c' rm -f 'strnative.c' rm -f 'include/parrot/transcode.h' rm -f 'include/parrot/strutf8.h' rm -f 'include/parrot/strutf32.h' rm -f 'include/parrot/strutf16.h' rm -f 'include/parrot/strnative.h' touch 'chartype.c' chmod 0644 'chartype.c' touch 'chartypes/unicode.c' chmod 0644 'chartypes/unicode.c' touch 'chartypes/usascii.c' chmod 0644 'chartypes/usascii.c' touch 'encoding.c' chmod 0644 'encoding.c' touch 'encodings/singlebyte.c' chmod 0644 'encodings/singlebyte.c' touch 'encodings/utf16.c' chmod 0644 'encodings/utf16.c' touch 'encodings/utf32.c' chmod 0644 'encodings/utf32.c' touch 'encodings/utf8.c' chmod 0644 'encodings/utf8.c' touch 'include/parrot/chartype.h' chmod 0644 'include/parrot/chartype.h' touch 'include/parrot/encoding.h' chmod 0644 'include/parrot/encoding.h' # # This command terminates the shell and need not be executed manually. exit # End of Preamble Patch data follows diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST' Index: ./MANIFEST *** ./MANIFEST Wed Oct 24 22:16:51 2001 --- ./MANIFEST Sat Oct 27 14:59:43 2001 *** *** 1,5 --- 1,8 assemble.pl ChangeLog + chartype.c + chartypes/unicode.c + chartypes/usascii.c classes/genclass.pl classes/intclass.c config_h.in *** *** 14,19 --- 17,27 docs/parrotbyte.pod docs/strings.pod docs/vtables.pod + encoding.c + encodings/singlebyte.c + encodings/utf8.c + encodings/utf16.c + encodings/utf32.c examples/assembly/bsr.pasm examples/assembly/call.pasm examples/assembly/euclid.pasm *** *** 29,34 --- 37,44 global_setup.c hints/mswin32.pl hints/vms.pl + include/parrot/chartype.h + include/parrot/encoding.h include/parrot/events.h include/parrot/exceptions.h include/parrot/global_setup.h *** *** 45,55 include/parrot/runops_cores.h include/parrot/stacks.h include/parrot/string.h - include/parrot/strnative.h - include/parrot/strutf16.h - include/parrot/strutf32.h - include/parrot/strutf8.h - include/parrot/transcode.h include/parrot/trace.h include/parrot/unicode.h interpreter.c --- 55,60 *** *** 107,116 runops_cores.c stacks.c string.c - strnative.c - strutf16.c - strutf32.c - strutf8.c test_c.in test_main.c Test/More.pm --- 112,117 *** *** 128,134 t/op/time.t t/op/trans.t trace.c - transcode.c Types_pm.in vtable_h.pl vtable.tbl --- 129,134 diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in' Index: ./Makefile.in *** ./Makefile.in Wed Oct 24 19:23:47 2001 --- ./Makefile.in Sat Oct 27 15:02:45 2001 *** *** 11,19 $(INC)/pmc.h $(INC)/resources.h O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \ ! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \ ! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \ ! trace$(O) vtable_ops$(O)
Re: String rationale
In message [EMAIL PROTECTED] Tom Hughes [EMAIL PROTECTED] wrote: Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) Before anybody else spots, let me just add what I forget to mention in my original post, which is that transcoding isn't implemented yet as I'm still thinking about the best way to do it. There is a hook in place ready for it though. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
At 04:23 PM 10/27/2001 +0100, Tom Hughes wrote: In message [EMAIL PROTECTED] Tom Hughes [EMAIL PROTECTED] wrote: Other than that it looked quite good and I'll probably start looking at bending the existing code into the new model over the weekend. Attached is my first pass at this - it's not fully ready yet but is something for people to cast an eye over before I spend lots of time going down the wrong path ;-) It looks pretty good on first glance. The packfile stuff is just a hack to make it work for now. Presumably we will have to modify the byte code format to record the string types as names or something so we can look them up properly? Yup. I think tagging the strings with a few type integers and a set of name-type tables in the bytecode are going to be needed for this. String comparison is not language sensitive here - as before it just compares based on character values. I'm still unsure as to how to properly handle locale-aware comparison, which is an interesting problem in and of itself. Luckily we just need to make the facilities for it, and someone else handles the policy. :) Other than that I think it's aiming in the right direction and it does pass all the tests... Please correct me if I'm wrong. Let me mull it over a bit. I think I'm going to commit it, but a second think on it won't hurt. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
String rationale
'Kay, here's the string background info I promised. If things are missing or unclear let me know and I'll fix it up until it is. ==Cut here with a very sharp knife=== =head1 TITLE A parrot string backgrounder =head1 Overview Strings, in parrot, are compartmentalized, the same way so much else in Parrot is compartmentalized. There's no single 'blessed' string encoding--the closest we come is Unicode, and only as an encoding of last resort. (Unicode's not a good interchange format, as it loses information) =head2 From the Outside On the outside, the interpreter considers strings to be a sort of black box. The only bits of the interpreter that much care about the string data are the regex engine parts, and those only operate on fixed-sized data. The interpreter can only peek inside a string if that string is of fixed length, and the interpreter doesn't actually care about the character set the data is in. All character sets must provide a way to transcode to Unicode, and all character encodings must provide a way to turn their characters into fixed-sized entities. (The size may be 8, 16, or 32 bits as need be for the character set) Character sets may provide a way to transcode to non-Unicode sets, for example from EBCDIC to ASCII, but this is optional. If none is provided a transcoding from one set to another will use Unicode as an intermediate form, complete with potential data loss. All character sets must provide the character lists the regular expression engine needs for the base character classes. (space, word, and digit characters) This permits the regular expression code to operate on the contents of a string without needing to know its actual character set. =head2 From the Inside =head2 Technical details The base string structure looks like: struct parrot_string { void *bufstart; INTVAL buflen; INTVAL bufused; INTVAL flags; INTVAL strlen; STRING_VTABLE* encoding; INTVAL type; INTVAL lanugage; } =head2 Fields =over 4 =item bufstart Where the string buffer starts =item buflen How big the buffer is =item bufused How much of the buffer's used =item flags A variety of flags. Low 16 bits reserved to Parrot, the rest are free for the string encoding library to use =item strlen How long the string is in code points. (Note that, for encodings that are more than 8 bits per code point, or of variable length, this will Enot be the same as the buffer used. =item encoding Pointer to the library that handles the string encoding. Encoding is basically how the stream of bytes pointed to by Cbufstart can be turned into a stream of 32-bit codepoints. Examples include UTF-8, Big 5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first =item type What the character set or type of data is encoded in the buffer. This includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a combination of type and encoding. I'll update the doc as soon as I can reasonablty separate the two) =item language The language the string is in. This is essential for proper sorting, if a sort function wants to be language-aware. Just an encoding/type is insufficient for proper sorting--for example knowing a string is UTF-32/Unicode doesn't tell you how the data should be ordered. This is especially important for those languages that overlap in the Unicode code space. Japanese and Chinese, for example, share many of the Unicode code points but sort those code points differently. =back Libraries for processing character sets and encodings are shareable libraries, and may be loaded on demand. They are looked up and referenced by name. An identifying number is given to them at load time and shouldn't be used outside the currently running process. (EBCDIC might be character set 3 in one run and set 7 in another) The native encoding and character set is Inever considered a 'real' encoding or character set. It just specifies what the default is if nothing else is specified, but when bytecode is frozen to disk the actual encoding or set name will be used instead. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
On Thu, 25 Oct 2001, Dan Sugalski wrote: The only bits of the interpreter that much care about the string data are the regex engine parts, and those only operate on fixed-sized data. Care to elaborate? I thought the mandate from Larry was to have regexes compile down to a stream of string ops. Doesn't that mean it should work regardless of the encoding of the string? The interpreter can only peek inside a string if that string is of fixed length, and the interpreter doesn't actually care about the character set the data is in. Why is this necessary at all? Wouldn't it be prefereable to have all access go through the String vtable regardless of the encoding? =item encoding Pointer to the library that handles the string encoding. Encoding is basically how the stream of bytes pointed to by Cbufstart can be turned into a stream of 32-bit codepoints. Examples include UTF-8, Big 5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first .first? Aside from the above, this was a nice refresher. -sam
Re: String rationale
At 12:19 PM 10/25/2001 -0400, Sam Tregar wrote: On Thu, 25 Oct 2001, Dan Sugalski wrote: The only bits of the interpreter that much care about the string data are the regex engine parts, and those only operate on fixed-sized data. Care to elaborate? I thought the mandate from Larry was to have regexes compile down to a stream of string ops. Doesn't that mean it should work regardless of the encoding of the string? Since the encoding just determines how the abstract code point numbers are represented in bytes, I'm OK with requiring strings we process internally to be in a fixed-size version. And regexes will be done with a stream of parrot opcodes, presuming that's not too slow. There'll be ops to reference the code point at position X in a string and check to see if its in a list of other code points and suchlike things. Basically we'll peek under the covers, but only for fixed-length strings. The interpreter can only peek inside a string if that string is of fixed length, and the interpreter doesn't actually care about the character set the data is in. Why is this necessary at all? Wouldn't it be prefereable to have all access go through the String vtable regardless of the encoding? Speed. We're going to take something of a hit decomposing to ops as it is--if we can safely cheat, I'm OK with mandating it to be required. :) =item encoding Pointer to the library that handles the string encoding. Encoding is basically how the stream of bytes pointed to by Cbufstart can be turned into a stream of 32-bit codepoints. Examples include UTF-8, Big 5, or Shift JIS. Unicode, Ascii, or EBCDIC are Bnot encodings.first .first? Trailing buffer gook. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: String rationale
In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: =item type What the character set or type of data is encoded in the buffer. This includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a combination of type and encoding. I'll update the doc as soon as I can reasonablty separate the two) Isn't this going to need to be a vtable pointer like encoding is? Only some things (like character classification and at least some transcoding tasks) will be character set based rather than encoding based. Other than that it looked quite good and I'll probably start looking at bending the existing code into the new model over the weekend. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/
Re: String rationale
At 11:59 PM 10/25/2001 +0100, Tom Hughes wrote: In message [EMAIL PROTECTED] Dan Sugalski [EMAIL PROTECTED] wrote: =item type What the character set or type of data is encoded in the buffer. This includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a combination of type and encoding. I'll update the doc as soon as I can reasonablty separate the two) Isn't this going to need to be a vtable pointer like encoding is? Yup. I'd intended it to be an index into a table of character set functions. Jarkko has convinced me that it's better to have it as a vtable pointer, but I haven't had a chance to update the docs yet. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk