Re: encoding vs charset

NotFound Wed, 16 Jul 2008 06:59:04 -0700

On Wed, Jul 16, 2008 at 1:13 AM, Moritz Lenz
<[EMAIL PROTECTED]> wrote:
> NotFound wrote:
>>> * Unicode isn't necessarily universal, or might stop to be so in future.
>>> If a character is not representable in Unicode, and you chose to use
>>> Unicode for everything, you're screwed
>> There are provision for private usage codepoints.
> If we use them in parrot, we can't use them in HLLs, right? do we really
> want that?


I don't understand that point. An HLL can use any codepoint wanted, no
matter if there ir a glyph for it in any available font. The way of
writing it in the source is not important to parrot, you just need to
emit valid pir, create a valid pbc, or whatever.

>>> * related to the previous point, some other character encodings might
>>> not have a lossless round-trip conversion.
>> Did we need that? The intention is that strings are stored in the
>> format wanted and not recoded without a good reason.
> But if you can't work with non-Unicode text strings, you have to convert
> them, and in the process you possibly lose information. That's why we
> want to enable text strings with non-Unicode semantics.

But the point is precisely that we don't need to take any text as non-Unicode.

>>> Introducing the "no character set" character set is just a special case
>>> of arbitrary character sets. I see no point in using the special case
>>> over the generic one.
>> Because is special, and we need to deal with his speciality in any
>> case. Just concatenating it with any other is plain wrong. Just
>> treating it as iso-8859-1 is not taken in as plain binary at all.
> Just as it is plain wrong to concatenate strings in an two
> non-compatible character sets (unless you store the strings as trees,

Yes, and because of that the approach of considering unicode the only
character set is simpler. That way the concatenation as text any pair
of text strings has no other problem that deciding the destination
encoding.

>> But the main point is that the encoding issues is complicated enough
>> even inside unicode, and adding another layer of complexity will make
>> it worse.
> I think that distinguishing incompatible character sets is no harder
> than distinguishing text and binary strings. It's not another layer,
> it's just a layer used in a more general way.

And what will be that way? In the current implemenation we have ascii,
iso-8859-1 and unicode charsets (not counting binary). Add another
charset, and we need a conversion to/from all this. Add another, and
sum and multiply.

With the unicode and encodings approach, adding any 8-bit or less
charset taken as unicode encoding is just adding a table of his 256
corresponding codepoints.

-- 
Salu2

Re: encoding vs charset

Reply via email to