Re: Nicest UTF

Philippe Verdy Sun, 05 Dec 2004 10:14:03 -0800

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>

"Philippe Verdy" <[EMAIL PROTECTED]> writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.

The question is why you would need to extract the nth codepoint so blindly. If you have such reasons, because you know the context in which this index is valid and usable, then you can as well extract a sequence using an index in the SCSU encoding itself using the same knowledge.

Linguistically, extracting a substring or characters at any random index in a sequence of code points will only cause you problems. In general, you will more likely use index as a way to mark a known position that you have already parsed sequentially in the past.

However it is true that if you have determined a good index position to allow future extraction of substrings, SCSU will be more complex because you not only need to remember the index, but also the current state of the SCSU decoder, to allow decoding characters encoded starting at that index. This is not needed for UTF's and most legacy character encodings, or national standards, or GB18030 which looks like a valid UTF, even though it is not part of the Unicode standard itself.

But remember the context in which this discussion was introduced: which UTF would be the best to represent (and store) large sets of immutable strings. The discussion about indexes in substrings is not relevevant in that context.

Re: Nicest UTF

Reply via email to