* Jim Monty wrote: >I mean the sixty-six code point Unicode reserves as noncharacters (e.g., >U+FFFE). > > http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters > >But I'm most keenly interested in a utility that detects broken UTF-16 >surrogate >pairs. By this, I mean a 16-bit code unit in the range from 0xD800 thru 0xDBFF >not immediately followed by a 16-bit code unit in the range from 0xDC00 thru >0xDFFF, and vice versa.
The simple solution to that is a small state machine that you put each byte through, http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ does that for UTF-8. UTF-16 is quite a bit simpler, so to use that approach you can make a map between bytes and character classes (to save space): my @byte2class = ( (map { 0 } 0x00 .. 0xD7), (map { 1 } 0xD8 .. 0xDB), (map { 2 } 0xDC .. 0xDF), (map { 0 } 0xE0 .. 0xFF), ); and then you create a transition table $d[0][0] = 1; # - seen a normal lead byte $d[0][1] = 2; # - seen a high surrogate lead $d[0][2] = ?; # - unexpected low surrogate $d[1][0] = 0; # after a normal lead byte $d[1][1] = 0; # anything goes, so this $d[1][2] = 0; # returns to the start state $d[2][0] = 3; # this is the second byte $d[2][1] = 3; # of a high surrogate which $d[2][2] = 3; # can also be anything $d[3][0] = ?; # - bad normal lead byte $d[3][1] = ?; # - bad high surrogate lead $d[3][2] = 4; # - what we are looking for $d[4][0] = 0; # this is the second byte $d[4][1] = 0; # of a low surrogate; when $d[4][2] = 0; # it's read, loop to start You did not say how you want to resynchronize the stream, so I left some of the entries blank (you could for instance make a state 5, check for that state, emit a replacement character in place of the byte you read, and then let the automaton go from state 5 back to state 0, as one error recovery strategy). You'd then simulate the state machine like so: while (...) { my $byte = ...; $state = $d[ $state ][ $byte2class[ $byte ] ]; # error handling would go here } You could handle other undesired sequences in the same way, but that'd make the automaton a bit hard to write manually. Also, you would have to make two separate automata, one for each of the variants (though you can of course use the same table and just set the start state accordingly). -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/