Re: [fltk.development] Unicode spaces and line breaks

Duncan Gibson Tue, 18 May 2010 06:10:12 -0700

Ian:
>>> Would it be OK for me to push my tweak, that makes it honour
>>> ERRORS_TO_CP1252 for the C1 chars?


Me:
>> - fl_wcwidth(const char* src)
>>   This calls the fltk2 fl_utf8decode() function to get the UCS
>>   value first, and therefore does handle CP1252 C1 chars
>>   [if you have the macros defined of course]

Ian:
> My tweak was just to make fl_wcwidth_() do the "Right Thing" for
> the C1 chars. I envisaged that it would be cheaper to run than the
> more correct fl_wcwidth() version and so might still have a place
> (e.g. in the Fl_Text_* cases or etc.)

OK, sorry, I hadn't noticed the fine nuance of what you were asking.

But why would you want to do that? Unless you are only dealing in
CP1252, you will need to keep track of all of the UTF-8 boundaries
youself so that you know whether this 0x8? character is CP1252 or
a UTF-8 sequence byte.

My own feeling, gleaned from chipping away at the edges of Unicode,
CP1252 and UTF-8 without ever really having used them in anger, is
that we are currently balancing on the edge of maintainability.

By that I mean that almost everyone is accustomed to using plain
ASCII, or single byte encodings anyway, and manually testing for
newlines, spaces, tabs, etc. on a byte by byte basis. There is a
lot of code -- user code as well as FLTK code -- that simply steps
through a byte arrary, byte at a time, tests for a byte value, or
does bit-twiddling against a byte value. With the advent of UTF-8
byte sequences, we now really need to convert all of that code to
use char* pointers to the start of such sequences. Furthermore, we
need to ensure the pre-condition that the char* is a pointer to the
first byte of a UTF-8 byte sequence, and not to the second, third
or fourth bytes of a UTF-8 byte sequence.

We have no control over user code, but we do control FLTK code,
and we should ensure that we get rid of as much byte-as-character
logic in the code as possible. We need to move away from using
an integer index into a byte array, and fl_utf8len(bytes[i]) to
calculate the index increment, because we know this doesn't always
do what we expect. Instead we need to use char* pointers into the
array, and increment them using fl_utf8fwd() because this uses
fl_utf8decode() internally and can handle any byte value sensibly.

The downside is that although the fl_utf8decode() strategy (with
CP1252 C1 0x80-0x9f conversion) is a lot safer than the fl_utf8len() one, it is 
certainly not as Fast and Light. But if we don't convert
to use fl_utf8decode() and fl_utf8fwd(), I suspect that there will
be an awful lot of special case handling in the code to work round
these CP1252 characters.

It could be that my view has been somewhat tainted by trying to work
round all of the special case handling in the FL_Text_* widgets and
your mileage may vary.

Cheers
Duncan

PS. The other thing that I think we really need is to attract a lot
    of new users/developers who are working in CJK, Cyrillic, Hebrew,
    Arabic, etc. who will exercise this code in ways that I know
    that I can not.
_______________________________________________
fltk-dev mailing list
fltk-dev@easysw.com
http://lists.easysw.com/mailman/listinfo/fltk-dev

Re: [fltk.development] Unicode spaces and line breaks

Reply via email to