[scintilla] UTF-8 to UCS conversion and other encoding utilities

Reece Dunn Sat, 14 Jan 2006 17:14:15 -0800

The UTF8 <==> UCS conversion utilities in scintilla/src/UniConversion.hwould be useful to the outside world. For example, in my application, I amreturning the selected text and searching using UCS encoded Windows BSTRs.


What would also be useful are character navigation functions:


int GetUTF8ByteCount( char ch ) // Source: wikipedia/UTF8
{
  switch( ch & 0xF0 )
  {
     case 0x0: ... case 0x7: return 1;
     case 0x8: ... case 0xB: return -1;
     case 0xC: case 0xD: return 2;
     case 0xE: return 3;
     case 0xF: return 4;
  }
}

Return the position of the first non-partial character by scanning forwardor backward, so that the index is not located mid-character:


int Normalize( int pos, bool forward )
{
  if( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
  {
     pos = forward ? CharNext( pos ) : CharPrev( pos );
  }
  return pos;
}

Return the next/previous non-partial character:

int CharNext( int pos )
{
  while( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
  {
     ++pos;
  }
  return pos;
}

int CharPrev( int pos )
{
  while( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
  {
     --pos;
  }
  return pos;
}

This would help to allow users of the Scintilla library to write UCSinterfaces to the Scintilla control.

NOTE: The conversion algorithm doesn't handle the 4th UTF8 byte. I'massuming this is due to lack of support for UTF16 surrogate pairs andUnicode planar characters in Windows.


- Reece


_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest

[scintilla] UTF-8 to UCS conversion and other encoding utilities

Reply via email to