Re: [PATCH] string encoding for ISO 8859-1

James Mastros Wed, 17 Oct 2001 10:58:32 -0700

On Wed, 17 Oct 2001, Gregor N. Purdy wrote:
> Its still likely that I'm misunderstanding the intent, but I think
> that a .pbc file created by me with LANG=C is not necessarily going
> to generate string constants that have the same meaning when you go
> to run it on your platform of choice, which sounds bad to me.
It occours to me that this is applicable to utf16 and utf32 too; they
aren't unambigius unless they start with a BOM (Byte Order Marker), or
have their byteorder specified some other way.  Mayhap we should have utf8
(which is unambigous, though apparently is lossy for some asian
languages), and have string encodings utf16, utf16bom, where utf16 is in
the _running_ interpreter's byteorder, and utf16bom always begins with a
byte-order-marker.  (And so on for utf32)


Strings would always be /stored/ in utf16bom, and thus be
machine-independent, but upon loading them, you can always convert to
utf16 trivialy (chop off the first 2 bytes for UTF16, or the first four
for UTF32).  (Encodings without the BOM are easyer to deal with because
you don't have to worry about different byteorders -- all numbers have the
machine's native byteorder.)

I'd love to see the same thing happen with "native" encodings, but there
seems to be no lossless encoding we can use.

Hm.  Perhaps we need to seperate "encoding" from "type": Encoding
specifies how the hunk 'o storage encodes the codepoints, and type
specifies what codepoints map to what characters.

Thus, we could have BIG-5, utf8 encoded data.  This is
machine-independent, and also lossless for things originaly coded in BIG5.

(Of course, this could also easily have n^2 problems.)

        -=-  James Mastros

-- 
Put bin Laden out like a bad cigar: http://www.fieler.com/terror
"You know what happens when you bomb Afghanastan?  Thats right, you knock
over the rubble."       -=- SLM

Re: [PATCH] string encoding for ISO 8859-1

Reply via email to