Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

2009-02-20 Thread Bryan Jurish
moin all,

On 2009-02-20 06:20:18, Hans-Christoph Steiner h...@eds.org appears to
have written:
 On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:
 moin Hans, moin list, On 2009-02-19 18:43:49, Hans-Christoph
 Steiner h...@eds.org appears to have written:
 One other thing, it seems that the ASCII char are handled
 differently than the UTF-8 chars in g_rtext.c, I think you could
 use instead wcswidth(), mbstowcs() or other UTF-8 functions as
 described in the UTF-8 FAQ
 
 http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
 
 Certainly, but (A) we already have the UTF-8 byte string in keysym,
 and we need to append that whole string to the buffer anyways, and
 (B) using wcswidth()  co requires forcing the locale to have a
 UTF-8 LC_CTYPE.  I know I did this in m_pd.c, but I think that was
 a HACK and that using locale functions here is the Wrong Way To Do
 It, because it's dangerous, unportable, and slow (warning: rant
 follows):
 
 __dangerous__: setting the locale is global for all threads of a 
 process; in  forcing the locale, we could conceivably mess with
 desired behavior elsewhere (e.g. in externals).
 
 __unportable__: we don't even know if all users' machines *have* a
 UTF-8 locale installed, and even if they do, we don't know what
 it's called. If we don't force the encoding, we're stuck with
 either C (e.g. ASCII; what we've got now in Pd-vanilla), or
 whatever the user is currently employing (after
 setlocale(LC_ALL,)), which makes patches' appearance dependent on
 the user's encoding (e.g. what we've got now in Pd-vanilla), and
 doesn't even work in the case of variable-length encodings such as
 UTF-8.
 
 __slow__: many locale-based conversion functions are known to be
 pretty darned slow.  if we assume we're always dealing with (valid)
 UTF-8, we can speed things up considerably.  going straight to
 wchar_t is another option, but would require many more changes on
 the C side, likely break the C API, and wouldn't solve the
 locale-dependency of patches' appearances, which I think is a
 really good argument for UTF-8.
 
 Isn't it pretty safe to assume these days that UTF-8 is supported?

Yes, but under what name?  Also, I believe the relevant locale variable
(LC_CTYPE) requires a language component prior to the charmap, and we
cannot guarantee that e.g. en_US is installed everywhere.  The only
locale guaranteed to be installed everywhere is C, and that determines
language and charmap simultaneously.

Also, the dangerous property is impossible to get around, unless maybe
we treat the locale like a stack and only force
LC_CTYPE=(whatever).UTF-8 in code where we know we want/need UTF-8.  I
suspect this might slow things down enormously (although I haven't
tested exactly what kind of overhead is involved).  Adding threads to
the picture means that we would have to add locking on LC_CTYPE (or
similar) and that would only work if hypothetical locale-sensitive
externals respected the same locks.  All in all more trouble than it's
worth, IM(ns)HO.

 One thing I just found out is that Windows uses a 2-byte char
 natively (UCS-2?),

Probably.

 I think Mac OS X uses UTF-8 natively. 

... but not for wchar_t (which would be superfluous if sizeof(wchar_t)==1) !

 I think that most Linux tools should work with UTF-8 too, especially since it
 can work as ASCII.

Yes, but working with UTF-8 is by no means synonymous with supporting
a particular (and known) value of LC_CTYPE which happens to use UTF-8 as
its charmap.  Most text-processing tools work with UTF-8 because they
can get away with just churning bytes -- this is not the case for Pd
(which counts characters to move the selection, edit buffers, determine
box widths, and maybe more)...

 So you think we can have full UTF-8 support without using those
 functions?

In a word: yes.

Specifically, I think we can have full UTF-8 support without using those
functions *as provided by the C99 locale API*.  That amounts to rolling
our own versions of the same and/or similar functionality.  In
particular, the (utf8.c,utf8.h) code by Jeff Bezanson (see
http://www.cprogramming.com/tutorial/unicode.html) has some attractive
utilities for wrapping typical string-processing code (in particular,
u8_inc() and u8_dec() for adapting old byte-string processing code using
i++ and i--, respectively), in addition to wrappers for the usual
locale-style functionality:

 wcswidth() -- (trivial)   (I've written the code)
 mbstowcs() -- u8_toucs()  (I've actually got a version of this too)

Other of Bezanson's utilities (isutf8(), u8_offset(), u8_charnum(),
u8_nextchar()) are also potentially useful for adapting the C side, and
in some cases, I'm not even sure how to wrap them with the C locale
functions without converting the whole UTF-8 string to wchar_t, which I
think we can agree we do not want to do.  Assumedly, Bezanson's code
(public domain) code is safe for integration with anything, so I'll use
that for now, if no one objects.

 That said, a faster implementation would 

Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

2009-02-19 Thread Hans-Christoph Steiner

This is good news!  While the C changes aren't dead simple, they are  
not bad.  I think they could be slightly simplified.  One thing that  
would make it much easier to read the diff is if you create it without  
whitespace changes.  So like this:

svn diff -x -w

As for the Tcl changes, I think we can include those now in Pd-devel,  
as long they can work ok with unchanged C code.  Then once the new Tcl  
GUI is included we can refactor the C side of things with things like  
this.  One other thing, it seems that the ASCII char are handled  
differently than the UTF-8 chars in g_rtext.c, I think you could use  
instead wcswidth(), mbstowcs() or other UTF-8 functions as described  
in the UTF-8 FAQ

http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

.hc

On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:

 morning Hans, morning list,

 So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across  
 the
 board.  The TK side was easy (as Hans predicted); really just a call  
 to
 {fconfigure} in ::pd_connect::configure_socket.  I also set the output
 encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes;
 it's probably wisest to leave those encodings at the default (user's
 current locale LC_CTYPE) for a release-like version.

 The C side is much hairier.  I think I've got things basically working
 (at least for message boxes and comments), but it has so far required
 changes in:

 FILE: g_editor.c
 + changed handling of Key events as passed to the C side to generate
 UTF-8 symbol-strings rather than single-byte stringlets.

 + currently use sprintf(%C) to get the UTF-8 string for the  
 codepoint
 passed from Tk; a safer (and not too hard) way would be to pass the
 actual UTF-8 string from Tk and just copy that: this would avoid the
 m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below).  Another option
 would be actually just writing (or borrowing) the code to generate  
 UTF-8
 strings from Unicode codepoints.  It's pretty simple stuff; I've still
 got the guts of it somewhere (only written for latin-1 so far, but the
 principle is the same for all codepoints).

 FILE: m_pd.c
 + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is  
 an
 ugly stinky nasty hack to get sprintf(%C) to output a UTF-8 encoded
 string from an unicode codepoint int, as called by canvas_key() in
 g_editor.c

 FILE: g_rtext.c
 + added an 'else if' clause in rtext_key() to handle unicode  
 codepoints
 as values of the 'keynum' parameter.  should also be safe for any 8- 
 bit
 fixed-width encoding.

 FILE: pd.tk
 + set system encoding, also output encoding for stdout, stderr to  
 UTF-8

 Attached is a screenshot and a test patch.  UTF-8 input from the
 keyboard works with the test patch, and gets carried through  
 properly to
 the .pd file (and back on load).

 I'd like to get symbol atoms working too (haven't tried yet), but  
 there
 are still some nasty buglets with comments and message boxes, mostly
 that editing any multibyte characters is very tricky: looks like the  
 Tk
 point (cursor) and selection are expressed in characters, and Pd's C
 side is still thinking in bytes, though I'm totally ignorant of  
 where or
 how that can be changed.  A non-critical buglet with the same cause
 (probably) is that the C side is computing the required width for
 message boxes based on byte lengths, not character lengths, so message
 boxes containing multibyte characters look too wide.  I could live  
 with
 that, but the editing thing is a real pain...

 I've attached a diff of my changes against branches/pd-devel/0.41.4/ 
 src
 (please excuse commented-out debugging code), in case anyone wants to
 try this stuff out.  Since it's not working, I'm reluctant to check
 anything into the pd-devel/0.41.4 branch yet -- should I branch again
 for a work in progress, or do we just pass diffs around for now?

 marmosets,
   Bryan

 On 2009-02-12 06:24:44, Hans-Christoph Steiner h...@eds.org  
 appears to
 have written:
 On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
 On 2009-02-11 03:04:34, Hans-Christoph Steiner h...@eds.org  
 appears to
 This is something that I would really like to have working  
 properly in
 Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should  
 support
 UTF-8 in Pd.  Anyone feel like trying to fix it?

 -- 
 Bryan Jurish   There is *always* one more  
 bug.
 jur...@ling.uni-potsdam.de  -Lubarsky's Law of Cybernetic  
 Entomology
 test-utf8.pdtest-utf8.pngIndex: m_pd.c
 ===
 --- m_pd.c(revision 10779)
 +++ m_pd.c(working copy)
 @@ -295,6 +295,18 @@
 void glob_init(void);
 void garray_init(void);

 +/*--BEGIN moo--*/
 +#include locale.h
 +void locale_init(void) {
 +  setlocale(LC_ALL,);
 +  setlocale(LC_NUMERIC,C);
 +  setlocale(LC_CTYPE,en_US.UTF-8);
 +  /*
 +  printf(moo: locale=%s\n, setlocale(LC_ALL,NULL));
 +  printf(moo: LC_CTYPE=%s\n, setlocale(LC_CTYPE,NULL));
 +  

Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

2009-02-19 Thread Hans-Christoph Steiner

On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:

 moin Hans, moin list,

 On 2009-02-19 18:43:49, Hans-Christoph Steiner h...@eds.org  
 appears to
 have written:

 This is good news!  While the C changes aren't dead simple, they  
 are not
 bad.  I think they could be slightly simplified.  One thing that  
 would
 make it much easier to read the diff is if you create it without
 whitespace changes.  So like this:

 svn diff -x -w

 oops, sorry... duly noted for future diffs ... I also set my emacs'
 tcl-indent-width to 8 ... sorry sorry sorry ...

 As for the Tcl changes, I think we can include those now in Pd- 
 devel, as
 long they can work ok with unchanged C code.

 Done.

 Then once the new Tcl GUI
 is included we can refactor the C side of things with things like  
 this.

 One other thing, it seems that the ASCII char are handled differently
 than the UTF-8 chars in g_rtext.c, I think you could use instead
 wcswidth(), mbstowcs() or other UTF-8 functions as described in the
 UTF-8 FAQ

 http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

 Certainly, but (A) we already have the UTF-8 byte string in keysym,  
 and
 we need to append that whole string to the buffer anyways, and (B)
 using wcswidth()  co requires forcing the locale to have a UTF-8
 LC_CTYPE.  I know I did this in m_pd.c, but I think that was a HACK  
 and
 that using locale functions here is the Wrong Way To Do It, because  
 it's
 dangerous, unportable, and slow (warning: rant follows):

 __dangerous__: setting the locale is global for all threads of a
 process; in  forcing the locale, we could conceivably mess with  
 desired
 behavior elsewhere (e.g. in externals).

 __unportable__: we don't even know if all users' machines *have* a  
 UTF-8
 locale installed, and even if they do, we don't know what it's called.
 If we don't force the encoding, we're stuck with either C (e.g.  
 ASCII;
 what we've got now in Pd-vanilla), or whatever the user is currently
 employing (after setlocale(LC_ALL,)), which makes patches'  
 appearance
 dependent on the user's encoding (e.g. what we've got now in
 Pd-vanilla), and doesn't even work in the case of variable-length
 encodings such as UTF-8.

 __slow__: many locale-based conversion functions are known to be  
 pretty
 darned slow.  if we assume we're always dealing with (valid) UTF-8, we
 can speed things up considerably.  going straight to wchar_t is  
 another
 option, but would require many more changes on the C side, likely  
 break
 the C API, and wouldn't solve the locale-dependency of patches'
 appearances, which I think is a really good argument for UTF-8.

Isn't it pretty safe to assume these days that UTF-8 is supported?   
One thing I just found out is that Windows uses a 2-byte char natively  
(UCS-2?), I think Mac OS X uses UTF-8 natively.  I think that most  
Linux tools should work with UTF-8 too, especially since it can work  
as ASCII.

So you think we can have full UTF-8 support without using those  
functions?

 (rant finished now, sorry)

 That said, a faster implementation would probably result from mixing
 (something like) wcswidth() and strncpy(...,keysym).  Functions like
 wcswidth() and mbstowcs() are pretty easy to cook up if we assume
 wchar_t is UCS-4 and the multibyte encoding is UTF-8.

It seems to me that the wcswidth() would be used for measuring the  
length of the text for display in boxes.  I suppose strlen() could  
still be used for allocating and freeing memory, but I think that we  
should aim for clean code.  If you think the current way in your diff  
is the best, that's fine by me.

 There are a
 number of libraries and code snippets floating about in the net making
 just such assumptions. In this context: are there any licensing
 restrictions on code included in pd-devel?  So far, I've found one
 useful-looking (.c,.h) pair in the public domain, as well as some LGPL
 code from gnulib, which could be linked in statically.  There's also
 code from the Unicode Consortium themselves, but it's pretty monstrous
 (read pedantic) and limited to string-to-string conversions.

Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed.  For this  
stage of Pd-devel, it would be good to keep it to something that can  
be BSD licensed.

.hc



 marmosets,
   Bryan

 On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:

 So I've tried to get the pd-devel 0.41.4 branch to use UTF-8  
 across the
 board.  The TK side was easy (as Hans predicted);
 [snip]
 The C side is much hairier.
 [snip]

 -- 
 Bryan Jurish   There is *always* one more  
 bug.
 jur...@ling.uni-potsdam.de  -Lubarsky's Law of Cybernetic  
 Entomology




Access to computers should be unlimited and total.  - the hacker ethic



___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

2009-02-17 Thread Bryan Jurish
morning Hans, morning list,

So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the
board.  The TK side was easy (as Hans predicted); really just a call to
{fconfigure} in ::pd_connect::configure_socket.  I also set the output
encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes;
it's probably wisest to leave those encodings at the default (user's
current locale LC_CTYPE) for a release-like version.

The C side is much hairier.  I think I've got things basically working
(at least for message boxes and comments), but it has so far required
changes in:

FILE: g_editor.c
+ changed handling of Key events as passed to the C side to generate
UTF-8 symbol-strings rather than single-byte stringlets.

+ currently use sprintf(%C) to get the UTF-8 string for the codepoint
passed from Tk; a safer (and not too hard) way would be to pass the
actual UTF-8 string from Tk and just copy that: this would avoid the
m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below).  Another option
would be actually just writing (or borrowing) the code to generate UTF-8
strings from Unicode codepoints.  It's pretty simple stuff; I've still
got the guts of it somewhere (only written for latin-1 so far, but the
principle is the same for all codepoints).

FILE: m_pd.c
+ added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is an
ugly stinky nasty hack to get sprintf(%C) to output a UTF-8 encoded
string from an unicode codepoint int, as called by canvas_key() in
g_editor.c

FILE: g_rtext.c
+ added an 'else if' clause in rtext_key() to handle unicode codepoints
as values of the 'keynum' parameter.  should also be safe for any 8-bit
fixed-width encoding.

FILE: pd.tk
+ set system encoding, also output encoding for stdout, stderr to UTF-8

Attached is a screenshot and a test patch.  UTF-8 input from the
keyboard works with the test patch, and gets carried through properly to
the .pd file (and back on load).

I'd like to get symbol atoms working too (haven't tried yet), but there
are still some nasty buglets with comments and message boxes, mostly
that editing any multibyte characters is very tricky: looks like the Tk
point (cursor) and selection are expressed in characters, and Pd's C
side is still thinking in bytes, though I'm totally ignorant of where or
how that can be changed.  A non-critical buglet with the same cause
(probably) is that the C side is computing the required width for
message boxes based on byte lengths, not character lengths, so message
boxes containing multibyte characters look too wide.  I could live with
that, but the editing thing is a real pain...

I've attached a diff of my changes against branches/pd-devel/0.41.4/src
(please excuse commented-out debugging code), in case anyone wants to
try this stuff out.  Since it's not working, I'm reluctant to check
anything into the pd-devel/0.41.4 branch yet -- should I branch again
for a work in progress, or do we just pass diffs around for now?

marmosets,
Bryan

On 2009-02-12 06:24:44, Hans-Christoph Steiner h...@eds.org appears to
have written:
 On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
 On 2009-02-11 03:04:34, Hans-Christoph Steiner h...@eds.org appears to
 This is something that I would really like to have working properly in
 Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should support
 UTF-8 in Pd.  Anyone feel like trying to fix it?

-- 
Bryan Jurish   There is *always* one more bug.
jur...@ling.uni-potsdam.de  -Lubarsky's Law of Cybernetic Entomology


test-utf8.pd
Description: application/puredata
inline: test-utf8.pngIndex: m_pd.c
===
--- m_pd.c  (revision 10779)
+++ m_pd.c  (working copy)
@@ -295,6 +295,18 @@
 void glob_init(void);
 void garray_init(void);
 
+/*--BEGIN moo--*/
+#include locale.h
+void locale_init(void) {
+  setlocale(LC_ALL,);
+  setlocale(LC_NUMERIC,C);
+  setlocale(LC_CTYPE,en_US.UTF-8);
+  /*
+  printf(moo: locale=%s\n, setlocale(LC_ALL,NULL));
+  printf(moo: LC_CTYPE=%s\n, setlocale(LC_CTYPE,NULL));
+  */
+}
+
 void pd_init(void)
 {
 mess_init();
@@ -302,5 +314,5 @@
 conf_init();
 glob_init();
 garray_init();
+locale_init(); /*-- moo --*/
 }
-
Index: g_editor.c
===
--- g_editor.c  (revision 10779)
+++ g_editor.c  (working copy)
@@ -1468,9 +1468,16 @@
 gotkeysym = av[1].a_w.w_symbol;
 else if (av[1].a_type == A_FLOAT)
 {
+   /*-- moo: old
 char buf[3];
-sprintf(buf, %c, (int)(av[1].a_w.w_float));
+   sprintf(buf, %c, (int)(av[1].a_w.w_float));
 gotkeysym = gensym(buf);
+   --*/
+char buf[8];
+   sprintf(buf, %C, (int)(av[1].a_w.w_float));
+   /*printf(moo: charcode %%d=%d, %%c=%c, %%C=%C\n, 
(int)(av[1].a_w.w_float), (int)(av[1].a_w.w_float), (int)(av[1].a_w.w_float));*/
+   /*printf(moo: buf='%s'\n, buf);*/
+gotkeysym =