RE: Roundtripping Solved

Lars Kristan Wed, 15 Dec 2004 10:41:39 -0800

Title: RE: Roundtripping Solved

Arcane Jill wrote:

> solution, again without breaking the Unicode model. If I have

> It is for reasons of requirement (4) that Lars proposes the
> introduction of
> 128 BMP codepoints. His intention is that they be marked as
> "reserved - do
> not use", so that requirement 4 is met.

Actually, Jill, they are not reserved. No more than U+0041 is reserved.
They are simply dedicated for a particular use. Which is not true for my PUA solution.

And my solution does not break the Unicode model. The proposal would break the Unicode model if my conversion would replace the now-standard conversion. I can even show that the consequences of that would be no more serious than the filesystem problem I am solving. But at this point, I am not proposing that. I am proposing merely that these codepoints be assigned.

Breaking the model is not why UTC is rejecting to consider this proposal. A couple of possible reasons:

* UTC feel that allowing (well, encouraging) a new way of handling invalid sequences might slow down the transition.
* UTC feel that allowing (well, encouraging) a new way of handling invalid sequences might lead to late detection of mislabelled data.

* UTC feel that the problem in question has nothing to do with Unicode.
* UTC feel that by stating filenames are binary data, they have solved the problem. Ignoring the cost they may be causing.

* UTC should have realized the need for these codepoints years ago, but now prefer to stick with the original decision.

As for your solution, I didn't really analyze it. But it is escaping, isn't it? With a lot of overhead. Filesystems have limitations. Say up to 255 characters for a filename. Representing a 255 (Unicode) characters long filename from Windows on UNIX (in UTF-8) is not always possible. There is not much we can do about it. But representing a 255 characters (chars) long filename from UNIX on a Windows system? Currently always possible. An escaping technique with a lot of overhead breaks that. Hence my pleeds to consider assigning the 128 codepoints in BMP, because otherwise an invalid sequence consisting of a single Latin 1 character maps to 2 UTF-16 shorts. And if filesystem limitions can be seen as somewhat unnecessary goal, there is transmission overhead and one other thing: in C, you can guess (for performance resons) the maximum amount of memory you need for a certain conversion. And the multipliers are typically around 2 (bytes per byte). Even a plane other than BMP raises that to 4, other escaping techniques are far worse.

Lars

RE: Roundtripping Solved

Reply via email to