RE: UTF-16 Encoding Scheme and U+FFFE

Peter Constable Wed, 04 Jun 2014 08:57:12 -0700

How did the word “prohibited” enter this conversation?

Peter

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: June 3, 2014 11:54 PM
To: Richard Wordingham
Cc: unicode@unicode.org
Subject: Re: UTF-16 Encoding Scheme and U+FFFE

U+FFFE is prohibited in interchanges because if interchanges specify a UTF-16 
encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM where it 
occurs at start of a stream (with the consequence of reparsing it as U+FEFF 
with bytes swapped). In all other positions where it cannot be a BOM.

BOM are *normally* only authorized in interchanges at "start" of streams.

But this is a problem for "live" streams that have no defined "start" but can 
be synced at random positions (such as on the next newline, or the start of a 
network datagram, but note that some network layers may fragment them so that 
BOM could be repeated, and also reunite them, leaving multiple BOMs in the same 
datagram) so we can assume that U+FFFE anywhere in a UTF16 "live" stream, not a 
UTF16-BE or UTF16-LE stream, is each time a BOM and not a BOM or legacy ZWNBSP 
or a non-character)

Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded for 
interchanged if these files or live streams may be transmitted without metadata 
specifying its encoding explicitly (so many remote readers will interpret them 
instead as UTF16, possibly with multiple BOMs in resynchronizable live streams).

The problem of live streams is also a good reason why WZNBSP (U+FEFF) has been 
strongly discouraged in interchanges in vafor of word joiner (and this also 
applies to all other conforming UTFs (including UTF-8, UTF16-BE, UTF16-LE, 
UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not to use U+FEFF 
and U+FFFE except for BOMs (possibly repeated on live streams).

You should note that conforminf processes working in interchanges (or storage) 
should always be allowed to switch from one standard UTF to
another. And the same encoded streams may be consumed by various clients having 
different native order. It is now become difficult to define what is a "local" 
system, when applications are converted to work in a cloud with more and more 
heterogeneous clients and more intermediate third parties (providing things 
like caching, archiving, proxying, backup of data and restauration on another 
system...).

For long term reusability of data, we are strongly encouraged not to use U+FFFE 
and U+FEFF except for BOMs, and we should be tolerant about the number of BOMs 
found (an in my opinion, UCA implementations should ignore discard them on 
input, treating them as fully ignorable, except for delimiting combining base 
characters for the prupose of normalisation, that conforming applications or 
intermediate filters should be allowed to perform as they want. And we should 
absolutely forget the legacy semantic of ZWNBSP.

But this complexity and tolerance for one or more BOMs also means that all UTFs 
not based on 8-bit code units should be also discouraged in interchanges. This 
means that UTF-16 and UTF-32 should be discouraged, leaving only UTF-16BE or 
UTF-16LE or UTF-32BE not for storage or networking, but for temporary streams 
in memory used the "blackbox" internally implementing each conforming process. 
For all the rest, most applications now use UTF-8, possibly packaged within a 
generic compressed stream (binary compression of live streams remains possible, 
even if you cannot predict in the text encoding where the resynchronization 
points will occur: it's up to the protocol using this transport compression to 
properly define the resynchronization points).

In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs, ZWNSP 
or non-characters (and we can also expect that many applications will just 
discard them silently, as they only have a "no-op" role as BOMs in 8-bit 
streams). If an application ouputs an 8-bit stream that is not UTF-8, it wil 
drop all U+FEFF and U+FFFE found in input, and will often ouput its encoding of 
U+FEFF its non-UTF-8 encoding generated, frequently as a "magic" signature of 
this encoding. Secure digital signatures of text streams should also ignore 
these code units silently as these code units won't be relevant elsewhere in 
the chain of producers or consumers of this data (these secure digital 
signatures should be computed by dropping these discarvable U+FEFF and U+FFFE, 
normaling that data for example to NFC or NFD, and producing a specific UTF 
(the easiest one to avoid complications being to use UTF-32BE or UTF-32LE with 
a predetermined byte order, as specified by the digital signature algorithm).

Additionally it will be very easy to use as many U+FEFF code units as needed as 
ignorable extra BOMs, for cases where a protocol needs a safe "padding filler" 
f they want to use fixed-size block I/O with random access and easy 
resynchronization (in live streams), when the producer safely breaks data 
blocks at boundary of combining sequences (allowing these blocks to be 
normalized separately and reunited later witout creating problem.

2014-06-04 1:50 GMT+02:00 Richard Wordingham 
<richard.wording...@ntlworld.com<mailto:richard.wording...@ntlworld.com>>:
On Tue, 3 Jun 2014 21:28:05 +0000
Peter Constable <peter...@microsoft.com<mailto:peter...@microsoft.com>> wrote:

> There's never been anything preventing a file from containing and
> beginning with U+FFFE. It's just not a very useful thing to do, hence
> not very likely.
Well, while U+FFFE was apparently prohibited from public interchange,
one could be very confident of not finding it in an external file.  As
an internally generated file, it would then be much more likely to be
in the UTF-16BE or UTF-16LE encoding scheme.

Richard.
_______________________________________________
Unicode mailing list
Unicode@unicode.org<mailto:Unicode@unicode.org>
http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: UTF-16 Encoding Scheme and U+FFFE

Reply via email to