On Wed, Sep 13, 2000 at 01:33:33AM +0100, Markus Kuhn wrote:
> Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC:
> > '7' UTF-7
> > '8' UTF-8
> > '16be' UTF-16 big-endian
> > '16le' UTF-16 little-endian
> > '16ne' UTF-16 native-endian
> > '32be' UTF-32 big-endian
> > '32le' UTF-32 little-endian
> > '32ne' UTF-32 native-endian
>
> I would somehow prefer
>
> '7' UTF-7
> '8' UTF-8
> '16be' UTF-16 big-endian
> '16le' UTF-16 little-endian
> ! '16' UTF-16 native-endian
> '32be' UTF-32 big-endian
> '32le' UTF-32 little-endian
> ! '32' UTF-32 native-endian
>
> No need to introduce new acronyms and terms such as "ne".
True.
> > =head2 Handling Malformed Data
>
> What exactly is malformed UTF-8 data here?
>
> Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
>
> Does it also cover overlong UTF-8 sequences, i.e. any string
> containing any of the five bit sequences
>
> 1100000x,
> 11100000 100xxxxx,
> 11110000 1000xxxx,
> 11111000 10000xxx,
> 11111100 100000xx
>
> Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
> surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
> not occur in proper UTF-8 and UTF-32 data according to the standard
> (see note 3 in section R.4 of UCS)?
>
> It might be useful, if the spec were clearer here.
Thanks for the info.
> References:
>
> - ISO/IEC 10646-1:1993(E), Amd. 2,
> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
>
> - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen