Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Bjoern Hoehrmann Wed, 03 Nov 2010 17:48:00 -0700

* Jim Monty wrote:
>I mean the sixty-six code point Unicode reserves as noncharacters (e.g., 
>U+FFFE).
>
>    http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
>But I'm most keenly interested in a utility that detects broken UTF-16 
>surrogate 
>pairs. By this, I mean a 16-bit code unit in the range from 0xD800 thru 0xDBFF 
>not immediately followed by a 16-bit code unit in the range from 0xDC00 thru 
>0xDFFF, and vice versa.


The simple solution to that is a small state machine that you put each
byte through, http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ does that
for UTF-8. UTF-16 is quite a bit simpler, so to use that approach you
can make a map between bytes and character classes (to save space):

  my @byte2class = (
    (map { 0 } 0x00 .. 0xD7),
    (map { 1 } 0xD8 .. 0xDB),
    (map { 2 } 0xDC .. 0xDF),
    (map { 0 } 0xE0 .. 0xFF),
  );

and then you create a transition table

  $d[0][0] = 1; # - seen a normal lead byte
  $d[0][1] = 2; # - seen a high surrogate lead
  $d[0][2] = ?; # - unexpected low surrogate
  
  $d[1][0] = 0; # after a normal lead byte
  $d[1][1] = 0; # anything goes, so this 
  $d[1][2] = 0; # returns to the start state
  
  $d[2][0] = 3; # this is the second byte
  $d[2][1] = 3; # of a high surrogate which
  $d[2][2] = 3; # can also be anything
  
  $d[3][0] = ?; # - bad normal lead byte
  $d[3][1] = ?; # - bad high surrogate lead
  $d[3][2] = 4; # - what we are looking for
  
  $d[4][0] = 0; # this is the second byte
  $d[4][1] = 0; # of a low surrogate; when
  $d[4][2] = 0; # it's read, loop to start

You did not say how you want to resynchronize the stream, so I left some
of the entries blank (you could for instance make a state 5, check for
that state, emit a replacement character in place of the byte you read,
and then let the automaton go from state 5 back to state 0, as one error
recovery strategy). You'd then simulate the state machine like so:

  while (...) {
    my $byte = ...;
    $state = $d[ $state ][ $byte2class[ $byte ] ];
    # error handling would go here
  }

You could handle other undesired sequences in the same way, but that'd
make the automaton a bit hard to write manually. Also, you would have to
make two separate automata, one for each of the variants (though you can
of course use the same table and just set the start state accordingly).
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to