Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
moin all, On 2009-02-20 06:20:18, Hans-Christoph Steiner h...@eds.org appears to have written: On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote: moin Hans, moin list, On 2009-02-19 18:43:49, Hans-Christoph Steiner h...@eds.org appears to have written: One other thing, it seems that the ASCII char are handled differently than the UTF-8 chars in g_rtext.c, I think you could use instead wcswidth(), mbstowcs() or other UTF-8 functions as described in the UTF-8 FAQ http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod Certainly, but (A) we already have the UTF-8 byte string in keysym, and we need to append that whole string to the buffer anyways, and (B) using wcswidth() co requires forcing the locale to have a UTF-8 LC_CTYPE. I know I did this in m_pd.c, but I think that was a HACK and that using locale functions here is the Wrong Way To Do It, because it's dangerous, unportable, and slow (warning: rant follows): __dangerous__: setting the locale is global for all threads of a process; in forcing the locale, we could conceivably mess with desired behavior elsewhere (e.g. in externals). __unportable__: we don't even know if all users' machines *have* a UTF-8 locale installed, and even if they do, we don't know what it's called. If we don't force the encoding, we're stuck with either C (e.g. ASCII; what we've got now in Pd-vanilla), or whatever the user is currently employing (after setlocale(LC_ALL,)), which makes patches' appearance dependent on the user's encoding (e.g. what we've got now in Pd-vanilla), and doesn't even work in the case of variable-length encodings such as UTF-8. __slow__: many locale-based conversion functions are known to be pretty darned slow. if we assume we're always dealing with (valid) UTF-8, we can speed things up considerably. going straight to wchar_t is another option, but would require many more changes on the C side, likely break the C API, and wouldn't solve the locale-dependency of patches' appearances, which I think is a really good argument for UTF-8. Isn't it pretty safe to assume these days that UTF-8 is supported? Yes, but under what name? Also, I believe the relevant locale variable (LC_CTYPE) requires a language component prior to the charmap, and we cannot guarantee that e.g. en_US is installed everywhere. The only locale guaranteed to be installed everywhere is C, and that determines language and charmap simultaneously. Also, the dangerous property is impossible to get around, unless maybe we treat the locale like a stack and only force LC_CTYPE=(whatever).UTF-8 in code where we know we want/need UTF-8. I suspect this might slow things down enormously (although I haven't tested exactly what kind of overhead is involved). Adding threads to the picture means that we would have to add locking on LC_CTYPE (or similar) and that would only work if hypothetical locale-sensitive externals respected the same locks. All in all more trouble than it's worth, IM(ns)HO. One thing I just found out is that Windows uses a 2-byte char natively (UCS-2?), Probably. I think Mac OS X uses UTF-8 natively. ... but not for wchar_t (which would be superfluous if sizeof(wchar_t)==1) ! I think that most Linux tools should work with UTF-8 too, especially since it can work as ASCII. Yes, but working with UTF-8 is by no means synonymous with supporting a particular (and known) value of LC_CTYPE which happens to use UTF-8 as its charmap. Most text-processing tools work with UTF-8 because they can get away with just churning bytes -- this is not the case for Pd (which counts characters to move the selection, edit buffers, determine box widths, and maybe more)... So you think we can have full UTF-8 support without using those functions? In a word: yes. Specifically, I think we can have full UTF-8 support without using those functions *as provided by the C99 locale API*. That amounts to rolling our own versions of the same and/or similar functionality. In particular, the (utf8.c,utf8.h) code by Jeff Bezanson (see http://www.cprogramming.com/tutorial/unicode.html) has some attractive utilities for wrapping typical string-processing code (in particular, u8_inc() and u8_dec() for adapting old byte-string processing code using i++ and i--, respectively), in addition to wrappers for the usual locale-style functionality: wcswidth() -- (trivial) (I've written the code) mbstowcs() -- u8_toucs() (I've actually got a version of this too) Other of Bezanson's utilities (isutf8(), u8_offset(), u8_charnum(), u8_nextchar()) are also potentially useful for adapting the C side, and in some cases, I'm not even sure how to wrap them with the C locale functions without converting the whole UTF-8 string to wchar_t, which I think we can agree we do not want to do. Assumedly, Bezanson's code (public domain) code is safe for integration with anything, so I'll use that for now, if no one objects. That said, a faster implementation would
Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
This is good news! While the C changes aren't dead simple, they are not bad. I think they could be slightly simplified. One thing that would make it much easier to read the diff is if you create it without whitespace changes. So like this: svn diff -x -w As for the Tcl changes, I think we can include those now in Pd-devel, as long they can work ok with unchanged C code. Then once the new Tcl GUI is included we can refactor the C side of things with things like this. One other thing, it seems that the ASCII char are handled differently than the UTF-8 chars in g_rtext.c, I think you could use instead wcswidth(), mbstowcs() or other UTF-8 functions as described in the UTF-8 FAQ http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod .hc On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote: morning Hans, morning list, So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the board. The TK side was easy (as Hans predicted); really just a call to {fconfigure} in ::pd_connect::configure_socket. I also set the output encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes; it's probably wisest to leave those encodings at the default (user's current locale LC_CTYPE) for a release-like version. The C side is much hairier. I think I've got things basically working (at least for message boxes and comments), but it has so far required changes in: FILE: g_editor.c + changed handling of Key events as passed to the C side to generate UTF-8 symbol-strings rather than single-byte stringlets. + currently use sprintf(%C) to get the UTF-8 string for the codepoint passed from Tk; a safer (and not too hard) way would be to pass the actual UTF-8 string from Tk and just copy that: this would avoid the m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below). Another option would be actually just writing (or borrowing) the code to generate UTF-8 strings from Unicode codepoints. It's pretty simple stuff; I've still got the guts of it somewhere (only written for latin-1 so far, but the principle is the same for all codepoints). FILE: m_pd.c + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is an ugly stinky nasty hack to get sprintf(%C) to output a UTF-8 encoded string from an unicode codepoint int, as called by canvas_key() in g_editor.c FILE: g_rtext.c + added an 'else if' clause in rtext_key() to handle unicode codepoints as values of the 'keynum' parameter. should also be safe for any 8- bit fixed-width encoding. FILE: pd.tk + set system encoding, also output encoding for stdout, stderr to UTF-8 Attached is a screenshot and a test patch. UTF-8 input from the keyboard works with the test patch, and gets carried through properly to the .pd file (and back on load). I'd like to get symbol atoms working too (haven't tried yet), but there are still some nasty buglets with comments and message boxes, mostly that editing any multibyte characters is very tricky: looks like the Tk point (cursor) and selection are expressed in characters, and Pd's C side is still thinking in bytes, though I'm totally ignorant of where or how that can be changed. A non-critical buglet with the same cause (probably) is that the C side is computing the required width for message boxes based on byte lengths, not character lengths, so message boxes containing multibyte characters look too wide. I could live with that, but the editing thing is a real pain... I've attached a diff of my changes against branches/pd-devel/0.41.4/ src (please excuse commented-out debugging code), in case anyone wants to try this stuff out. Since it's not working, I'm reluctant to check anything into the pd-devel/0.41.4 branch yet -- should I branch again for a work in progress, or do we just pass diffs around for now? marmosets, Bryan On 2009-02-12 06:24:44, Hans-Christoph Steiner h...@eds.org appears to have written: On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote: On 2009-02-11 03:04:34, Hans-Christoph Steiner h...@eds.org appears to This is something that I would really like to have working properly in Pd-devel. Tcl/Tk is natively UTF-8, so it seems that we should support UTF-8 in Pd. Anyone feel like trying to fix it? -- Bryan Jurish There is *always* one more bug. jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology test-utf8.pdtest-utf8.pngIndex: m_pd.c === --- m_pd.c(revision 10779) +++ m_pd.c(working copy) @@ -295,6 +295,18 @@ void glob_init(void); void garray_init(void); +/*--BEGIN moo--*/ +#include locale.h +void locale_init(void) { + setlocale(LC_ALL,); + setlocale(LC_NUMERIC,C); + setlocale(LC_CTYPE,en_US.UTF-8); + /* + printf(moo: locale=%s\n, setlocale(LC_ALL,NULL)); + printf(moo: LC_CTYPE=%s\n, setlocale(LC_CTYPE,NULL)); +
Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote: moin Hans, moin list, On 2009-02-19 18:43:49, Hans-Christoph Steiner h...@eds.org appears to have written: This is good news! While the C changes aren't dead simple, they are not bad. I think they could be slightly simplified. One thing that would make it much easier to read the diff is if you create it without whitespace changes. So like this: svn diff -x -w oops, sorry... duly noted for future diffs ... I also set my emacs' tcl-indent-width to 8 ... sorry sorry sorry ... As for the Tcl changes, I think we can include those now in Pd- devel, as long they can work ok with unchanged C code. Done. Then once the new Tcl GUI is included we can refactor the C side of things with things like this. One other thing, it seems that the ASCII char are handled differently than the UTF-8 chars in g_rtext.c, I think you could use instead wcswidth(), mbstowcs() or other UTF-8 functions as described in the UTF-8 FAQ http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod Certainly, but (A) we already have the UTF-8 byte string in keysym, and we need to append that whole string to the buffer anyways, and (B) using wcswidth() co requires forcing the locale to have a UTF-8 LC_CTYPE. I know I did this in m_pd.c, but I think that was a HACK and that using locale functions here is the Wrong Way To Do It, because it's dangerous, unportable, and slow (warning: rant follows): __dangerous__: setting the locale is global for all threads of a process; in forcing the locale, we could conceivably mess with desired behavior elsewhere (e.g. in externals). __unportable__: we don't even know if all users' machines *have* a UTF-8 locale installed, and even if they do, we don't know what it's called. If we don't force the encoding, we're stuck with either C (e.g. ASCII; what we've got now in Pd-vanilla), or whatever the user is currently employing (after setlocale(LC_ALL,)), which makes patches' appearance dependent on the user's encoding (e.g. what we've got now in Pd-vanilla), and doesn't even work in the case of variable-length encodings such as UTF-8. __slow__: many locale-based conversion functions are known to be pretty darned slow. if we assume we're always dealing with (valid) UTF-8, we can speed things up considerably. going straight to wchar_t is another option, but would require many more changes on the C side, likely break the C API, and wouldn't solve the locale-dependency of patches' appearances, which I think is a really good argument for UTF-8. Isn't it pretty safe to assume these days that UTF-8 is supported? One thing I just found out is that Windows uses a 2-byte char natively (UCS-2?), I think Mac OS X uses UTF-8 natively. I think that most Linux tools should work with UTF-8 too, especially since it can work as ASCII. So you think we can have full UTF-8 support without using those functions? (rant finished now, sorry) That said, a faster implementation would probably result from mixing (something like) wcswidth() and strncpy(...,keysym). Functions like wcswidth() and mbstowcs() are pretty easy to cook up if we assume wchar_t is UCS-4 and the multibyte encoding is UTF-8. It seems to me that the wcswidth() would be used for measuring the length of the text for display in boxes. I suppose strlen() could still be used for allocating and freeing memory, but I think that we should aim for clean code. If you think the current way in your diff is the best, that's fine by me. There are a number of libraries and code snippets floating about in the net making just such assumptions. In this context: are there any licensing restrictions on code included in pd-devel? So far, I've found one useful-looking (.c,.h) pair in the public domain, as well as some LGPL code from gnulib, which could be linked in statically. There's also code from the Unicode Consortium themselves, but it's pretty monstrous (read pedantic) and limited to string-to-string conversions. Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed. For this stage of Pd-devel, it would be good to keep it to something that can be BSD licensed. .hc marmosets, Bryan On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote: So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the board. The TK side was easy (as Hans predicted); [snip] The C side is much hairier. [snip] -- Bryan Jurish There is *always* one more bug. jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology Access to computers should be unlimited and total. - the hacker ethic ___ Pd-list@iem.at mailing list UNSUBSCRIBE and account-management - http://lists.puredata.info/listinfo/pd-list
[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
morning Hans, morning list, So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the board. The TK side was easy (as Hans predicted); really just a call to {fconfigure} in ::pd_connect::configure_socket. I also set the output encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes; it's probably wisest to leave those encodings at the default (user's current locale LC_CTYPE) for a release-like version. The C side is much hairier. I think I've got things basically working (at least for message boxes and comments), but it has so far required changes in: FILE: g_editor.c + changed handling of Key events as passed to the C side to generate UTF-8 symbol-strings rather than single-byte stringlets. + currently use sprintf(%C) to get the UTF-8 string for the codepoint passed from Tk; a safer (and not too hard) way would be to pass the actual UTF-8 string from Tk and just copy that: this would avoid the m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below). Another option would be actually just writing (or borrowing) the code to generate UTF-8 strings from Unicode codepoints. It's pretty simple stuff; I've still got the guts of it somewhere (only written for latin-1 so far, but the principle is the same for all codepoints). FILE: m_pd.c + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is an ugly stinky nasty hack to get sprintf(%C) to output a UTF-8 encoded string from an unicode codepoint int, as called by canvas_key() in g_editor.c FILE: g_rtext.c + added an 'else if' clause in rtext_key() to handle unicode codepoints as values of the 'keynum' parameter. should also be safe for any 8-bit fixed-width encoding. FILE: pd.tk + set system encoding, also output encoding for stdout, stderr to UTF-8 Attached is a screenshot and a test patch. UTF-8 input from the keyboard works with the test patch, and gets carried through properly to the .pd file (and back on load). I'd like to get symbol atoms working too (haven't tried yet), but there are still some nasty buglets with comments and message boxes, mostly that editing any multibyte characters is very tricky: looks like the Tk point (cursor) and selection are expressed in characters, and Pd's C side is still thinking in bytes, though I'm totally ignorant of where or how that can be changed. A non-critical buglet with the same cause (probably) is that the C side is computing the required width for message boxes based on byte lengths, not character lengths, so message boxes containing multibyte characters look too wide. I could live with that, but the editing thing is a real pain... I've attached a diff of my changes against branches/pd-devel/0.41.4/src (please excuse commented-out debugging code), in case anyone wants to try this stuff out. Since it's not working, I'm reluctant to check anything into the pd-devel/0.41.4 branch yet -- should I branch again for a work in progress, or do we just pass diffs around for now? marmosets, Bryan On 2009-02-12 06:24:44, Hans-Christoph Steiner h...@eds.org appears to have written: On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote: On 2009-02-11 03:04:34, Hans-Christoph Steiner h...@eds.org appears to This is something that I would really like to have working properly in Pd-devel. Tcl/Tk is natively UTF-8, so it seems that we should support UTF-8 in Pd. Anyone feel like trying to fix it? -- Bryan Jurish There is *always* one more bug. jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology test-utf8.pd Description: application/puredata inline: test-utf8.pngIndex: m_pd.c === --- m_pd.c (revision 10779) +++ m_pd.c (working copy) @@ -295,6 +295,18 @@ void glob_init(void); void garray_init(void); +/*--BEGIN moo--*/ +#include locale.h +void locale_init(void) { + setlocale(LC_ALL,); + setlocale(LC_NUMERIC,C); + setlocale(LC_CTYPE,en_US.UTF-8); + /* + printf(moo: locale=%s\n, setlocale(LC_ALL,NULL)); + printf(moo: LC_CTYPE=%s\n, setlocale(LC_CTYPE,NULL)); + */ +} + void pd_init(void) { mess_init(); @@ -302,5 +314,5 @@ conf_init(); glob_init(); garray_init(); +locale_init(); /*-- moo --*/ } - Index: g_editor.c === --- g_editor.c (revision 10779) +++ g_editor.c (working copy) @@ -1468,9 +1468,16 @@ gotkeysym = av[1].a_w.w_symbol; else if (av[1].a_type == A_FLOAT) { + /*-- moo: old char buf[3]; -sprintf(buf, %c, (int)(av[1].a_w.w_float)); + sprintf(buf, %c, (int)(av[1].a_w.w_float)); gotkeysym = gensym(buf); + --*/ +char buf[8]; + sprintf(buf, %C, (int)(av[1].a_w.w_float)); + /*printf(moo: charcode %%d=%d, %%c=%c, %%C=%C\n, (int)(av[1].a_w.w_float), (int)(av[1].a_w.w_float), (int)(av[1].a_w.w_float));*/ + /*printf(moo: buf='%s'\n, buf);*/ +gotkeysym =