Martin Michlmayr wrote:
I received a bug report against cplay (a front-end for audio players
written in Python and using ncurses) that it doesn't support UTF-8.
While trying to solve this problem, the bigger question came up
whether the Python bindings actually support UTF-8.

In Debian, we have a libncurses5 library and a libncursesw5 for wide
characters.  Is it just a matter of compiling the Python bindings
against libncursesw5 or is there work needed on the bindings itself so
they support UTF-8?

I'm uncertain as to how ncurses deals with multi-byte characters - one might think that UTF-8 is just a multi-byte encoding (where a single character can require multiple bytes), and that they can be sent to the terminal as-is. This, of course, would require that control sequences are sent to the terminal only at character boundaries - something that the application would have to guarantee.

The other issue with multi-byte characters is columnns - you cannot
equate "single byte == single column". OTOH, in "true" non-ASCII
applications, you cannot thus equate, anyway, since some characters
(e.g. Hanji full-width characters) take two columns, anyway. Not
sure how curses deals with that phenomenon.

I guess once there are UTF-8 aware Python curses bindings, I have to
change cplay to use UTF-8 internally (however that may work), but
right now I'm wondering about the bindings itself.

The libncursesw5 is certainly not about UTF-8, and Python does not support it at the moment.

It probably should, which would mean that, on the C API, you don't
pass char/char* anymore, but wchar_t/wchar_t*. On the Python API,
you would pass Unicode objects, instead of string objects.

IOW, there seem to be two options:
1. Use char*, and UTF-8, and Python byte strings. Make sure you
   always keep the multiple bytes of a byte string together;
   this is easiest to achieve by converting them to Unicode
   temporarily. So instead of

   for c in data:
       if condition: output escape sequence
       output c

   do

   udata = data.decode("UTF-8")
   for c in udata:
       if condition: output escape sequence
       output c.encode("UTF-8")

2. Implement a true Unicode API for curses, using libncursesw.
   This would check the actual parameters to see whether they
   are byte strings or Unicode strings, and invoke the appropriate
   curses library (assuming you can mix curses and cursesw in single
   terminal - or choke if somebody tries to mix byte strings and
   Unicode strings in a single terminal).

   Then, above loop becomes

   udata = data.decode("UTF-8")
   for c in udata:
       if condition: output escape sequence
       output c

   So the difference would be that you can directly send Unicode
   characters, instead of encoding them as UTF-8 first.

In either case, you might need to deal with the issue of full-width
characters (i.e. characters that consume horizontally twice as
much space as the latin letters). Not all terminals support full-width
in the first place; xterm is an example for a terminal that does.

Regards,
Martin


-- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Reply via email to