Philippe Verdy <verd...@wanadoo.fr> wrote:
 [.]
 |- in UTF-8, you'll need to look backward between 1 to 3 positions before
 |your start position to find the leading 8-bit code unit (>= 0xC0).
 |
 |In both cases you have to check the value found. If you don't find it, in
 |the limited range of positions, the input is not valid UTF-8 or UTF-16 and
 |you have to handle an encoding error exception in the input stream.
 |
 |The Unicode standarddoes not specify how you'll handle this error situation
 |or from where you'll be able to resync the stream, or even if you should
 |resync from some further position; this is application-dependant. If the

«Unicode Security Considerations» [1] gives hints on how defective
byte sequences should or could be handled (in «3.6.1 Illegal Input
Byte Sequences»).  This talks about conversion, but should be
applicable everywhere.

  [1] <http://www.unicode.org/reports/tr36/>

--steffen


Reply via email to