I wrote:
The UTF8 <==> UCS conversion utilities in scintilla/src/UniConversion.h
would be useful to the outside world. For example, in my application, I am
returning the selected text and searching using UCS encoded Windows BSTRs.
What would also be useful are character navigation functions:
int GetUTF8ByteCount( char ch ) // Source: wikipedia/UTF8
{
switch( ch & 0xF0 )
{
case 0x0: ... case 0x7: return 1;
case 0x8: ... case 0xB: return -1;
case 0xC: case 0xD: return 2;
case 0xE: return 3;
case 0xF: return 4;
}
}
This should be:
int GetUTF8ByteCount( char ch ) // Source: wikipedia/UTF8
{
switch( ch & 0xF0 )
{
case 0x00: ... case 0x70: return 1;
case 0x80: ... case 0xB0: return -1;
case 0xC0: case 0xD0: return 2;
case 0xE0: return 3;
case 0xF0: return 4;
}
return -1; // Will never get here (level 4 warning fix).
}
Return the position of the first non-partial character by scanning forward
or backward, so that the index is not located mid-character:
int Normalize( int pos, bool forward )
{
if( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
{
pos = forward ? CharNext( pos ) : CharPrev( pos );
}
return pos;
}
Return the next/previous non-partial character:
int CharNext( int pos )
{
++pos;
while( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
{
++pos;
}
return pos;
}
int CharPrev( int pos )
{
--pos;
while( GetUTF8ByteCount( GetCharAt( pos )) == -1 )
{
--pos;
}
return pos;
}
This would help to allow users of the Scintilla library to write UCS
interfaces to the Scintilla control.
I have tested the above on my code. I am selecting the current world that is
being read, but it would sometimes display a box character after things like
the curly quotes. This was because the cursor was moved to the middle of a
UTF8 character. Using the above methods has removed the issue.
- Reece
_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest