Title: Implementation of the roundtripping (was RE: Roundtripping in Unicode)

Mark Davis wrote:
> I see more of what you are trying to do; let me try to be more clear.
> Suppose that the conversion is defined in the following way,
> between Unicode
> strings (D29a-d, page 74) and UTFs using your proposed new
> characters, for
> now with private use code points U+E080..U+E0FF.

U+E080 is the first choice by anyone (including my implementor) for anything, and is therefore not very suitable. Also, AFAIK, U+E000..U+EDFF are used by EUDC's of some MBCS encodings. For the record, my choice was U+EE80..U+EEFF.

But I'll keep the rest of the response in-line with your range.


>
> U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
> 1. Set the pointer to the start
> 2. If the sequence starting at the pointer is a valid UTF-8 sequence
> (checking of course to make sure it doesn't go off the end of
> the string),
> convert it and emit.

With one addition. If the obtained value falls into the range of the escape codepoints (E080 to E0FF), jump to 3. Effectively, escape the escapes. Without this, the roundtrip is not achieved. An oversight that also my implementor made. As well as some other people in this thread.


> 3. Otherwise take the byte B following the pointer, and emit
> [E000 + B].

Assuming by 'following the pointer' you meant 'at the pointer'.


> Of course, one could apply this process between the Unicode
> bit strings and
> UTFs of other widths. And the same thing applies; one direction would
> roundtrip and the other wouldn't.
Yes. I have analyzed the consequences and the risks involved and reached the conclusion that they are either irrelevant or acceptable (or can be dealt with). And have decided to use this approach. It suits my needs, but I also think it would suit someone else's needs.

After conversions to U8, it is possible to 'validate' the result (convert back and compare with the original). Any sequence of escape codepoints that do not roundtrip in the UTF-U8-UTF direction can be declared as 'invalid' or 'ill-formed' sequence of codepoints (in this context, not in Unicode context). Note that all (and I think it is also precisely all) sequences obtained by U8-UTF conversion are 'valid' (in this context). Hence, any 'invalid' sequence can be seen as malicious. Indeed, I suppose an 'invalid' sequence can result from concatenation, but this does not apply to typical scenarios, at least not those that need to worry about it. Such 'validation' could be used in places where security concerns apply. But such 'validation' is not required in all security scenarios. On the contrary, I think it applies to very few, and only if they actually use the conversion themselves. If they simply process such sequences entirely in UTF (or any combination of UTFs), then again they remain solid (or, as solid as they are).


> (I realize that some of this may duplicate what others have said
Not really. There is a lot of confusion even about the algorithm itself and what it achieves. Right from the start I was assuming everyone was familiar with UTF-8B and therefore didn't want to start from scratch. But perhaps we should.


Lars

Reply via email to