Re: [NTG-context] Unicode question

Arthur Reutenauer Thu, 12 Mar 2015 09:38:46 -0700

> The luatex code contains the lines (in unistring.w)
> 
> if (val == 0xFFFD)
>         utf_error();
>     return (val);
> 
> in a function str2uni. I didn't really try to understand the code
> but it looks as if 0xFFFD is used as "invalid marker":


Interesting.  This is not actually correct, U+FFFD is a valid Unicode 
character; it would be better to use U+FFFE or U+FFFF for that.

Note that U+FFFD is the recommended character to use when a character can't be 
recognised while converting to Unicode from another encoding, so its presence 
is usually a sign that something went wrong upstream, but I assume Manfred is 
aware of that.

> The comment in the code says 
> 
> /* the 5- and 6-byte UTF-8 sequences generate integers 
> 
> that are outside of the valid UCS range, and therefore
> 
> unsupported 
>          */

That's correct, the longest valid UTF-8 sequence is 4 bytes.

Best,

Arthur

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Re: [NTG-context] Unicode question

Reply via email to