On Wed, 17 Oct 2001, Gregor N. Purdy wrote: > Its still likely that I'm misunderstanding the intent, but I think > that a .pbc file created by me with LANG=C is not necessarily going > to generate string constants that have the same meaning when you go > to run it on your platform of choice, which sounds bad to me. It occours to me that this is applicable to utf16 and utf32 too; they aren't unambigius unless they start with a BOM (Byte Order Marker), or have their byteorder specified some other way. Mayhap we should have utf8 (which is unambigous, though apparently is lossy for some asian languages), and have string encodings utf16, utf16bom, where utf16 is in the _running_ interpreter's byteorder, and utf16bom always begins with a byte-order-marker. (And so on for utf32)
Strings would always be /stored/ in utf16bom, and thus be machine-independent, but upon loading them, you can always convert to utf16 trivialy (chop off the first 2 bytes for UTF16, or the first four for UTF32). (Encodings without the BOM are easyer to deal with because you don't have to worry about different byteorders -- all numbers have the machine's native byteorder.) I'd love to see the same thing happen with "native" encodings, but there seems to be no lossless encoding we can use. Hm. Perhaps we need to seperate "encoding" from "type": Encoding specifies how the hunk 'o storage encodes the codepoints, and type specifies what codepoints map to what characters. Thus, we could have BIG-5, utf8 encoded data. This is machine-independent, and also lossless for things originaly coded in BIG5. (Of course, this could also easily have n^2 problems.) -=- James Mastros -- Put bin Laden out like a bad cigar: http://www.fieler.com/terror "You know what happens when you bomb Afghanastan? Thats right, you knock over the rubble." -=- SLM