Re: Best practices for replacing UTF-8 overlongs

Doug Ewell Mon, 19 Dec 2016 15:58:57 -0800

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,


This is overstated at best. Decoders weren't required to detect overlong
sequences until 2000, but it was never legal to generate them. This was
stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct
use of the instructions and table in RFC 2044 also precluded the
creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Best practices for replacing UTF-8 overlongs

Reply via email to