Re: [PD-dev] UTF-8 for pd-devel (again)

Bryan Jurish Tue, 19 Jan 2010 13:57:21 -0800

morning all,

attached is a UTF-8 support patch against branches/pd-gui-rewrite/0.43
revision 13051 (HEAD as of an hour or so ago).  most of the bulk is new
files (s_utf8.c, s_utf8.h), most other changes are in g_rtext.c.  It's
not too monstrous, and I've tested it again here briefly with some utf-8
test patches (see other attachment), and things appear to be working as
expected.  if desired, I can check this in; otherwise feel free to do it
for me ;-)


2 annoying things here during testing (I don't see how my patches could
have caused this, but you never know):

(1) all loaded patch windows appear at +0+0 (upper left corner), which
with my wm (windowmaker) means the title bar is off the screen, and I
have to resort to keyboard shortcuts to get them mouse-draggable, which
is a major pain in the wazoo: is this a known bug?

(2) I can't figure out how to get at the properties dialog for number,
number2, or any other gui-atom objects: should these be working already?

marmosets,
        Bryan

On 2010-01-18 23:09:34, Hans-Christoph Steiner <h...@eds.org> appears to
have written:
> 
> Awesome!  If its big and complicated, I say post it to the list first,
> if not too bad, then just commit.
> 
> .hc
> 
> On Jan 18, 2010, at 4:47 AM, Bryan Jurish wrote:
> 
>> moin Hans, moin list,
>>
>> I think perhaps I never actually did post the cleaned-up patch anywhere
>> (bad programmer, no biscuit);  I guess I'll check out
>> branches/pd-gui-rewrite/0.43 and try patching my changes in; then I can
>> either commit or just post the (updated) patch.  Hopefully no major
>> additional changes will be required, so it ought to go pretty fast.
>>
>> marmosets,
>>     Bryan
>>
>> On 2010-01-17 22:57:33, Hans-Christoph Steiner <h...@eds.org> appears to
>> have written:
>>>
>>> Hey Bryan,
>>>
>>> I'd like to try to get your UTF-8 code into pd-gui-rewrite.  You mention
>>> in this posting back in May that you had the whole thing working.  I
>>> couldn't find the diff/patch for this.  Is it posted anywhere?  Do you
>>> want to try to check it in yourself directly to the pd-gui-rewrite/0.43
>>> branch?
>>>
>>> .hc
>>>
>>>
>>> On Mar 20, 2009, at 6:16 PM, Bryan Jurish wrote:
>>>
>>>> morning all,
>>>>
>>>> Of course I never really like to see my code wither away in the bit
>>>> bucket, but I personally don't have any pressing need for UTF-8
>>>> symbols,
>>>> comments, etc. in Pd -- I'm a native English speaker, after all ;-)
>>>>
>>>> Also, my changes are by no means the only way to do it (or even the
>>>> best
>>>> way); we could gain a little speed by slapping on some more buffers
>>>> (mostly and possibly only in rtext_senditup()), but since this seems to
>>>> effect only GUI/editing stuff, I think we can live with a smidgeon of
>>>> additional cpu time ... after all, it's all O(n) anyways.
>>>>
>>>> Really I just wanted to see how easy (or difficult) it would be to get
>>>> Pd to use UTF-8 as its internal encoding... turned out to be harder
>>>> than
>>>> I had thought, but (ever so slightly) easier than I had feared :-/
>>>>
>>>> marmosets,
>>>>    Bryan
>>>>
>>>> On 2009-03-20 18:39:06, Hans-Christoph Steiner <h...@eds.org>
>>>> appears to
>>>> have written:
>>>>>
>>>>> I wonder what the best approach is to getting it included.  I also
>>>>> think
>>>>> its a very valuable contribution.  I think we need to first get the
>>>>> Tcl/Tk only changes done, since that was the mandate of the pd-devel
>>>>> 0.41 effort.  Then once Miller has accepted those changes, then we can
>>>>> start with the C modifications there.  So how to proceed next, I think
>>>>> is based on how eager you are, Bryan, to getting this in a regular
>>>>> build.
>>>>>
>>>>> One option is making a pd-devel-utf8 branch, another is posting these
>>>>> patches to the patch tracker and waiting for Miller to make his next
>>>>> update with the Pd-devel Tcl-Tk code.
>>>>>
>>>>> Maybe we can get Miller to chime in on this topic.
>>>>>
>>>>> .hc
>>>>>
>>>>> On Mar 13, 2009, at 12:00 AM, dmotd wrote:
>>>>>
>>>>>> hey bryan,
>>>>>>
>>>>>> just a quick note of a appreciation for getting this one out.. i hope
>>>>>> it gets
>>>>>> picked up in millers build soon.. a very useful and necessary
>>>>>> modification.
>>>>>>
>>>>>> well done!
>>>>>>
>>>>>> dmotd
>>>>>>
>>>>>> On Thursday 12 March 2009 08:07:50 Bryan Jurish wrote:
>>>>>>> moin folks,
>>>>>>>
>>>>>>> I believe I've finally got pd-devel 0.41-4 using UTF-8 across the
>>>>>>> board.
>>>>>>> So far, I've tested message boxes & comments (g_rtext), as well as
>>>>>>> symbol atoms, and all seems good.  I think we can still expect
>>>>>>> goofiness
>>>>>>> if someone names an abstraction using a multibyte character when the
>>>>>>> filesystem isn't UTF-8 encoded (raw 8-bit works for me here too),
>>>>>>> but I
>>>>>>> really don't want to open that particular can of worms.
>>>>>>>
>>>>>>> So I guess I have 2 questions:
>>>>>>>
>>>>>>> (1) what should I call the generic UTF-8 source files? (see my other
>>>>>>> post)
>>>>>>>
>>>>>>> (2) shall I commit these changes to pd-devel/0.41-4, or somewhere
>>>>>>> else,
>>>>>>> or just post a diff (ca. 33k, ought to be easier to read now; I've
>>>>>>> tried
>>>>>>> to follow the indentation conventions of the source files I
>>>>>>> modified)?
>>>>>>>
>>>>>>> marmosets,
>>>>>>>   Bryan
>>>>
>>>> -- 
>>>> Bryan Jurish                           "There is *always* one more
>>>> bug."
>>>> jur...@ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic
>>>> Entomology
>>>
>>>
>>>
>>> ----------------------------------------------------------------------------
>>>
>>>
>>>
>>> The arc of history bends towards justice.     - Dr. Martin Luther
>>> King, Jr.
>>>
>>>
>>
>> -- 
>> ***************************************************
>>
>> Bryan Jurish
>> Deutsches Textarchiv
>> Berlin-Brandenburgische Akademie der Wissenschaften
>>
>> Jägerstr. 22/23
>> 10117 Berlin
>>
>> Tel.:      +49 (0)30 20370 539
>> E-Mail:    jur...@bbaw.de
>>
>> ***************************************************
>>
> 
> 
> 
> ----------------------------------------------------------------------------
> 
> 
> As we enjoy great advantages from inventions of others, we should be
> glad of an opportunity to serve others by any invention of ours; and
> this we should do freely and generously.         - Benjamin Franklin
> 
> 
> 

-- 
Bryan Jurish                       "There is *always* one more bug."
jur...@uni-potsdam.de       -Lubarsky's Law of Cybernetic Entomology

Index: src/Makefile.am
===================================================================
--- src/Makefile.am     (revision 13051)
+++ src/Makefile.am     (working copy)
@@ -24,6 +24,7 @@
     m_conf.c m_glob.c m_sched.c \
     s_main.c s_inter.c s_file.c s_print.c \
     s_loader.c s_path.c s_entry.c s_audio.c s_midi.c \
+    s_utf8.c \
     d_ugen.c d_ctl.c d_arithmetic.c d_osc.c d_filter.c d_dac.c d_misc.c \
     d_math.c d_fft.c d_array.c d_global.c \
     d_delay.c d_resample.c \
Index: src/g_editor.c
===================================================================
--- src/g_editor.c      (revision 13051)
+++ src/g_editor.c      (working copy)
@@ -9,6 +9,7 @@
 #include "s_stuff.h"
 #include "g_canvas.h"
 #include <string.h>
+#include "s_utf8.h" /*-- moo --*/
 
 void glist_readfrombinbuf(t_glist *x, t_binbuf *b, char *filename,
     int selectem);
@@ -1666,8 +1667,9 @@
         gotkeysym = av[1].a_w.w_symbol;
     else if (av[1].a_type == A_FLOAT)
     {
-        char buf[3];
-        sprintf(buf, "%c", (int)(av[1].a_w.w_float));
+        /*-- moo: assume keynum is a Unicode codepoint; encode as UTF-8 --*/
+        char buf[UTF8_MAXBYTES1];
+        u8_wc_toutf8_nul(buf, (UCS4)(av[1].a_w.w_float));
         gotkeysym = gensym(buf);
     }
     else gotkeysym = gensym("?");
Index: src/s_utf8.c
===================================================================
--- src/s_utf8.c        (revision 0)
+++ src/s_utf8.c        (revision 0)
@@ -0,0 +1,280 @@
+/*
+  Basic UTF-8 manipulation routines
+  by Jeff Bezanson
+  placed in the public domain Fall 2005
+
+  This code is designed to provide the utilities you need to manipulate
+  UTF-8 as an internal string encoding. These functions do not perform the
+  error checking normally needed when handling UTF-8 data, so if you happen
+  to be from the Unicode Consortium you will want to flay me alive.
+  I do this because error checking can be performed at the boundaries (I/O),
+  with these routines reserved for higher performance on data known to be
+  valid.
+
+  modified by Bryan Jurish (moo) March 2009
+  + removed some unneeded functions (escapes, printf etc), added others
+*/
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdarg.h>
+#ifdef WIN32
+#include <malloc.h>
+#else
+#include <alloca.h>
+#endif
+
+#include "s_utf8.h"
+
+static const u_int32_t offsetsFromUTF8[6] = {
+    0x00000000UL, 0x00003080UL, 0x000E2080UL,
+    0x03C82080UL, 0xFA082080UL, 0x82082080UL
+};
+
+static const char trailingBytesForUTF8[256] = {
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
+};
+
+
+/* returns length of next utf-8 sequence */
+int u8_seqlen(char *s)
+{
+    return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]] + 1;
+}
+
+/* conversions without error checking
+   only works for valid UTF-8, i.e. no 5- or 6-byte sequences
+   srcsz = source size in bytes, or -1 if 0-terminated
+   sz = dest size in # of wide characters
+
+   returns # characters converted
+   dest will always be L'\0'-terminated, even if there isn't enough room
+   for all the characters.
+   if sz = srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be enough space.
+*/
+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz)
+{
+    u_int32_t ch;
+    char *src_end = src + srcsz;
+    int nb;
+    int i=0;
+
+    while (i < sz-1) {
+        nb = trailingBytesForUTF8[(unsigned char)*src];
+        if (srcsz == -1) {
+            if (*src == 0)
+                goto done_toucs;
+        }
+        else {
+            if (src + nb >= src_end)
+                goto done_toucs;
+        }
+        ch = 0;
+        switch (nb) {
+            /* these fall through deliberately */
+#if UTF8_SUPPORT_FULL_UCS4
+        case 5: ch += (unsigned char)*src++; ch <<= 6;
+        case 4: ch += (unsigned char)*src++; ch <<= 6;
+#endif
+        case 3: ch += (unsigned char)*src++; ch <<= 6;
+        case 2: ch += (unsigned char)*src++; ch <<= 6;
+        case 1: ch += (unsigned char)*src++; ch <<= 6;
+        case 0: ch += (unsigned char)*src++;
+        }
+        ch -= offsetsFromUTF8[nb];
+        dest[i++] = ch;
+    }
+ done_toucs:
+    dest[i] = 0;
+    return i;
+}
+
+/* srcsz = number of source characters, or -1 if 0-terminated
+   sz = size of dest buffer in bytes
+
+   returns # characters converted
+   dest will only be '\0'-terminated if there is enough space. this is
+   for consistency; imagine there are 2 bytes of space left, but the next
+   character requires 3 bytes. in this case we could NUL-terminate, but in
+   general we can't when there's insufficient space. therefore this function
+   only NUL-terminates if all the characters fit, and there's space for
+   the NUL as well.
+   the destination string will never be bigger than the source string.
+*/
+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz)
+{
+    u_int32_t ch;
+    int i = 0;
+    char *dest_end = dest + sz;
+
+    while (srcsz<0 ? src[i]!=0 : i < srcsz) {
+        ch = src[i];
+        if (ch < 0x80) {
+            if (dest >= dest_end)
+                return i;
+            *dest++ = (char)ch;
+        }
+        else if (ch < 0x800) {
+            if (dest >= dest_end-1)
+                return i;
+            *dest++ = (ch>>6) | 0xC0;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        else if (ch < 0x10000) {
+            if (dest >= dest_end-2)
+                return i;
+            *dest++ = (ch>>12) | 0xE0;
+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        else if (ch < 0x110000) {
+            if (dest >= dest_end-3)
+                return i;
+            *dest++ = (ch>>18) | 0xF0;
+            *dest++ = ((ch>>12) & 0x3F) | 0x80;
+            *dest++ = ((ch>>6) & 0x3F) | 0x80;
+            *dest++ = (ch & 0x3F) | 0x80;
+        }
+        i++;
+    }
+    if (dest < dest_end)
+        *dest = '\0';
+    return i;
+}
+
+/* moo: get byte length of character number, or 0 if not supported */
+int u8_wc_nbytes(u_int32_t ch)
+{
+  if (ch < 0x80) return 1;
+  if (ch < 0x800) return 2;
+  if (ch < 0x10000) return 3;
+  if (ch < 0x200000) return 4;
+#if UTF8_SUPPORT_FULL_UCS4
+  /*-- moo: support full UCS-4 range? --*/
+  if (ch < 0x4000000) return 5;
+  if (ch < 0x7fffffffUL) return 6;
+#endif
+  return 0; /*-- bad input --*/
+}
+
+int u8_wc_toutf8(char *dest, u_int32_t ch)
+{
+    if (ch < 0x80) {
+        dest[0] = (char)ch;
+        return 1;
+    }
+    if (ch < 0x800) {
+        dest[0] = (ch>>6) | 0xC0;
+        dest[1] = (ch & 0x3F) | 0x80;
+        return 2;
+    }
+    if (ch < 0x10000) {
+        dest[0] = (ch>>12) | 0xE0;
+        dest[1] = ((ch>>6) & 0x3F) | 0x80;
+        dest[2] = (ch & 0x3F) | 0x80;
+        return 3;
+    }
+    if (ch < 0x110000) {
+        dest[0] = (ch>>18) | 0xF0;
+        dest[1] = ((ch>>12) & 0x3F) | 0x80;
+        dest[2] = ((ch>>6) & 0x3F) | 0x80;
+        dest[3] = (ch & 0x3F) | 0x80;
+        return 4;
+    }
+    return 0;
+}
+
+/*-- moo --*/
+int u8_wc_toutf8_nul(char *dest, u_int32_t ch)
+{
+  int sz = u8_wc_toutf8(dest,ch);
+  dest[sz] = '\0';
+  return sz;
+}
+
+/* charnum => byte offset */
+int u8_offset(char *str, int charnum)
+{
+    int offs=0;
+
+    while (charnum > 0 && str[offs]) {
+        (void)(isutf(str[++offs]) || isutf(str[++offs]) ||
+               isutf(str[++offs]) || ++offs);
+        charnum--;
+    }
+    return offs;
+}
+
+/* byte offset => charnum */
+int u8_charnum(char *s, int offset)
+{
+    int charnum = 0, offs=0;
+
+    while (offs < offset && s[offs]) {
+        (void)(isutf(s[++offs]) || isutf(s[++offs]) ||
+               isutf(s[++offs]) || ++offs);
+        charnum++;
+    }
+    return charnum;
+}
+
+/* reads the next utf-8 sequence out of a string, updating an index */
+u_int32_t u8_nextchar(char *s, int *i)
+{
+    u_int32_t ch = 0;
+    int sz = 0;
+
+    do {
+        ch <<= 6;
+        ch += (unsigned char)s[(*i)++];
+        sz++;
+    } while (s[*i] && !isutf(s[*i]));
+    ch -= offsetsFromUTF8[sz-1];
+
+    return ch;
+}
+
+/* number of characters */
+int u8_strlen(char *s)
+{
+    int count = 0;
+    int i = 0;
+
+    while (u8_nextchar(s, &i) != 0)
+        count++;
+
+    return count;
+}
+
+void u8_inc(char *s, int *i)
+{
+    (void)(isutf(s[++(*i)]) || isutf(s[++(*i)]) ||
+           isutf(s[++(*i)]) || ++(*i));
+}
+
+void u8_dec(char *s, int *i)
+{
+    (void)(isutf(s[--(*i)]) || isutf(s[--(*i)]) ||
+           isutf(s[--(*i)]) || --(*i));
+}
+
+/*-- moo --*/
+void u8_inc_ptr(char **sp)
+{
+  (void)(isutf(*(++(*sp))) || isutf(*(++(*sp))) ||
+        isutf(*(++(*sp))) || ++(*sp));
+}
+
+/*-- moo --*/
+void u8_dec_ptr(char **sp)
+{
+  (void)(isutf(*(--(*sp))) || isutf(*(--(*sp))) ||
+        isutf(*(--(*sp))) || --(*sp));
+}
Index: src/g_rtext.c
===================================================================
--- src/g_rtext.c       (revision 13051)
+++ src/g_rtext.c       (working copy)
@@ -13,6 +13,7 @@
 #include "m_pd.h"
 #include "s_stuff.h"
 #include "g_canvas.h"
+#include "s_utf8.h"
 
 
 #define LMARGIN 2
@@ -32,10 +33,10 @@
 
 struct _rtext
 {
-    char *x_buf;
-    int x_bufsize;
-    int x_selstart;
-    int x_selend;
+    char *x_buf;    /*-- raw byte string, assumed UTF-8 encoded (moo) --*/
+    int x_bufsize;  /*-- byte length --*/
+    int x_selstart; /*-- byte offset --*/
+    int x_selend;   /*-- byte offset --*/
     int x_active;
     int x_dragfrom;
     int x_height;
@@ -119,6 +120,15 @@
 
 /* LATER deal with tcl-significant characters */
 
+/* firstone(), lastone()
+ *  + returns byte offset of (first|last) occurrence of 'c' in 's[0..n-1]', or
+ *    -1 if none was found
+ *  + 's' is a raw byte string
+ *  + 'c' is a byte value
+ *  + 'n' is the length (in bytes) of the prefix of 's' to be searched.
+ *  + we could make these functions work on logical characters in utf8 strings,
+ *    but we don't really need to...
+ */
 static int firstone(char *s, int c, int n)
 {
     char *s2 = s + n;
@@ -155,6 +165,16 @@
     of the entire text in pixels.
     */
 
+   /*-- moo: 
+    * + some variables from the original version have been renamed
+    * + variables with a "_b" suffix are raw byte strings, lengths, or offsets
+    * + variables with a "_c" suffix are logical character lengths or offsets
+    *   (assuming valid UTF-8 encoded byte string in x->x_buf)
+    * + a fair amount of O(n) computations required to convert between raw byte
+    *   offsets (needed by the C side) and logical character offsets (needed by
+    *   the GUI)
+    */
+
     /* LATER get this and sys_vgui to work together properly,
         breaking up messages as needed.  As of now, there's
         a limit of 1950 characters, imposed by sys_vgui(). */
@@ -171,14 +191,16 @@
 {
     t_float dispx, dispy;
     char smallbuf[200], *tempbuf;
-    int outchars = 0, nlines = 0, ncolumns = 0,
+    int outchars_b = 0, nlines = 0, ncolumns = 0,
         pixwide, pixhigh, font, fontwidth, fontheight, findx, findy;
     int reportedindex = 0;
     t_canvas *canvas = glist_getcanvas(x->x_glist);
-    int widthspec = x->x_text->te_width;
-    int widthlimit = (widthspec ? widthspec : BOXWIDTH);
-    int inindex = 0;
-    int selstart = 0, selend = 0;
+    int widthspec_c = x->x_text->te_width;
+    int widthlimit_c = (widthspec_c ? widthspec_c : BOXWIDTH);
+    int inindex_b = 0;
+    int inindex_c = 0;
+    int selstart_b = 0, selend_b = 0;
+    int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);
         /* if we're a GOP (the new, "goprect" style) borrow the font size
         from the inside to preserve the spacing */
     if (pd_class(&x->x_text->te_pd) == canvas_class &&
@@ -193,65 +215,76 @@
     if (x->x_bufsize >= 100)
          tempbuf = (char *)t_getbytes(2 * x->x_bufsize + 1);
     else tempbuf = smallbuf;
-    while (x->x_bufsize - inindex > 0)
+    while (x_bufsize_c - inindex_c > 0)
     {
-        int inchars = x->x_bufsize - inindex;
-        int maxindex = (inchars > widthlimit ? widthlimit : inchars);
+        int inchars_b  = x->x_bufsize - inindex_b;
+        int inchars_c  = x_bufsize_c  - inindex_c;
+        int maxindex_c = (inchars_c > widthlimit_c ? widthlimit_c : inchars_c);
+        int maxindex_b = u8_offset(x->x_buf + inindex_b, maxindex_c);
         int eatchar = 1;
-        int foundit = firstone(x->x_buf + inindex, '\n', maxindex);
-        if (foundit < 0)
+        int foundit_b  = firstone(x->x_buf + inindex_b, '\n', maxindex_b);
+        int foundit_c;
+        if (foundit_b < 0)
         {
-            if (inchars > widthlimit)
+            if (inchars_c > widthlimit_c)
             {
-                foundit = lastone(x->x_buf + inindex, ' ', maxindex);
-                if (foundit < 0)
+                foundit_b = lastone(x->x_buf + inindex_b, ' ', maxindex_b);
+                if (foundit_b < 0)
                 {
-                    foundit = maxindex;
+                    foundit_b = maxindex_b;
+                    foundit_c = maxindex_c;
                     eatchar = 0;
                 }
+                else
+                    foundit_c = u8_charnum(x->x_buf + inindex_b, foundit_b);
             }
             else
             {
-                foundit = inchars;
+                foundit_b = inchars_b;
+                foundit_c = inchars_c;
                 eatchar = 0;
             }
         }
+        else
+            foundit_c = u8_charnum(x->x_buf + inindex_b, foundit_b);
+
         if (nlines == findy)
         {
             int actualx = (findx < 0 ? 0 :
-                (findx > foundit ? foundit : findx));
-            *indexp = inindex + actualx;
+                (findx > foundit_c ? foundit_c : findx));
+            *indexp = inindex_b + u8_offset(x->x_buf + inindex_b, actualx);
             reportedindex = 1;
         }
-        strncpy(tempbuf+outchars, x->x_buf + inindex, foundit);
-        if (x->x_selstart >= inindex &&
-            x->x_selstart <= inindex + foundit + eatchar)
-                selstart = x->x_selstart + outchars - inindex;
-        if (x->x_selend >= inindex &&
-            x->x_selend <= inindex + foundit + eatchar)
-                selend = x->x_selend + outchars - inindex;
-        outchars += foundit;
-        inindex += (foundit + eatchar);
-        if (inindex < x->x_bufsize)
-            tempbuf[outchars++] = '\n';
-        if (foundit > ncolumns)
-            ncolumns = foundit;
+        strncpy(tempbuf+outchars_b, x->x_buf + inindex_b, foundit_b);
+        if (x->x_selstart >= inindex_b &&
+            x->x_selstart <= inindex_b + foundit_b + eatchar)
+                selstart_b = x->x_selstart + outchars_b - inindex_b;
+        if (x->x_selend >= inindex_b &&
+            x->x_selend <= inindex_b + foundit_b + eatchar)
+                selend_b = x->x_selend + outchars_b - inindex_b;
+        outchars_b += foundit_b;
+        inindex_b += (foundit_b + eatchar);
+        inindex_c += (foundit_c + eatchar);
+        if (inindex_b < x->x_bufsize)
+            tempbuf[outchars_b++] = '\n';
+        if (foundit_c > ncolumns)
+            ncolumns = foundit_c;
         nlines++;
     }
     if (!reportedindex)
-        *indexp = outchars;
+        *indexp = outchars_b;
     dispx = text_xpix(x->x_text, x->x_glist);
     dispy = text_ypix(x->x_text, x->x_glist);
     if (nlines < 1) nlines = 1;
-    if (!widthspec)
+    if (!widthspec_c)
     {
         while (ncolumns < 3)
         {
-            tempbuf[outchars++] = ' ';
+            tempbuf[outchars_b++] = ' ';
             ncolumns++;
         }
     }
-    else ncolumns = widthspec;
+    else ncolumns = widthspec_c;
     pixwide = ncolumns * fontwidth + (LMARGIN + RMARGIN);
     pixhigh = nlines * fontheight + (TMARGIN + BMARGIN);
 
@@ -259,31 +292,32 @@
         sys_vgui("pdtk_text_new .x%lx.c {%s %s text} %f %f {%.*s} %d %s\n",
             canvas, x->x_tag, rtext_gettype(x)->s_name,
             dispx + LMARGIN, dispy + TMARGIN,
-            outchars, tempbuf, sys_hostfontsize(font),
+            outchars_b, tempbuf, sys_hostfontsize(font),
             (glist_isselected(x->x_glist,
                 &x->x_glist->gl_gobj)? "blue" : "black"));
     else if (action == SEND_UPDATE)
     {
         sys_vgui("pdtk_text_set .x%lx.c %s {%.*s}\n",
-            canvas, x->x_tag, outchars, tempbuf);
+            canvas, x->x_tag, outchars_b, tempbuf);
         if (pixwide != x->x_drawnwidth || pixhigh != x->x_drawnheight) 
             text_drawborder(x->x_text, x->x_glist, x->x_tag,
                 pixwide, pixhigh, 0);
         if (x->x_active)
         {
-            if (selend > selstart)
+            if (selend_b > selstart_b)
             {
                 sys_vgui(".x%lx.c select from %s %d\n", canvas, 
-                    x->x_tag, selstart);
+                    x->x_tag, u8_charnum(x->x_buf, selstart_b));
                 sys_vgui(".x%lx.c select to %s %d\n", canvas, 
-                    x->x_tag, selend + (sys_oldtclversion ? 0 : -1));
+                    x->x_tag, u8_charnum(x->x_buf, selend_b)
+                             + (sys_oldtclversion ? 0 : -1));
                 sys_vgui(".x%lx.c focus \"\"\n", canvas);        
             }
             else
             {
                 sys_vgui(".x%lx.c select clear\n", canvas);
                 sys_vgui(".x%lx.c icursor %s %d\n", canvas, x->x_tag,
-                    selstart);
+                    u8_charnum(x->x_buf, selstart_b));
                 sys_vgui(".x%lx.c focus %s\n", canvas, x->x_tag);        
             }
         }
@@ -448,12 +482,12 @@
                 ....
             } */
             if (x->x_selstart && (x->x_selstart == x->x_selend))
-                x->x_selstart--;
+                u8_dec(x->x_buf, &x->x_selstart);
         }
         else if (n == 127)      /* delete */
         {
             if (x->x_selend < x->x_bufsize && (x->x_selstart == x->x_selend))
-                x->x_selend++;
+                u8_inc(x->x_buf, &x->x_selend);
         }
         
         ndel = x->x_selend - x->x_selstart;
@@ -466,7 +500,13 @@
 /* at Guenter's suggestion, use 'n>31' to test wither a character might
 be printable in whatever 8-bit character set we find ourselves. */
 
-        if (n == '\n' || (n > 31 && n != 127))
+/*-- moo:
+  ... but test with "<" rather than "!=" in order to accomodate unicode
+  codepoints for n (which we get since Tk is sending the "%A" substitution
+  for bind <Key>), effectively reducing the coverage of this clause to 7
+  bits.  Case n>127 is covered by the next clause.
+*/
+        if (n == '\n' || (n > 31 && n < 127))
         {
             newsize = x->x_bufsize+1;
             x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
@@ -476,20 +516,39 @@
             x->x_bufsize = newsize;
             x->x_selstart = x->x_selstart + 1;
         }
+       /*--moo: check for unicode codepoints beyond 7-bit ASCII --*/
+       else if (n > 127)
+        {
+            int ch_nbytes = u8_wc_nbytes(n);
+            newsize = x->x_bufsize + ch_nbytes;
+            x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
+            for (i = x->x_bufsize; i > x->x_selstart; i--)
+                x->x_buf[i] = x->x_buf[i-1];
+            x->x_bufsize = newsize;
+            /*-- moo: assume canvas_key() has encoded keysym as UTF-8 */
+            strncpy(x->x_buf+x->x_selstart, keysym->s_name, ch_nbytes);
+            x->x_selstart = x->x_selstart + ch_nbytes;
+        }
         x->x_selend = x->x_selstart;
         x->x_glist->gl_editor->e_textdirty = 1;
     }
     else if (!strcmp(keysym->s_name, "Right"))
     {
         if (x->x_selend == x->x_selstart && x->x_selstart < x->x_bufsize)
-            x->x_selend = x->x_selstart = x->x_selstart + 1;
+        {
+            u8_inc(x->x_buf, &x->x_selstart);
+            x->x_selend = x->x_selstart;
+        }
         else
             x->x_selstart = x->x_selend;
     }
     else if (!strcmp(keysym->s_name, "Left"))
     {
         if (x->x_selend == x->x_selstart && x->x_selstart > 0)
-            x->x_selend = x->x_selstart = x->x_selstart - 1;
+        {
+            u8_dec(x->x_buf, &x->x_selstart);
+            x->x_selend = x->x_selstart;
+        }
         else
             x->x_selend = x->x_selstart;
     }
@@ -497,18 +556,18 @@
     else if (!strcmp(keysym->s_name, "Up"))
     {
         if (x->x_selstart)
-            x->x_selstart--;
+            u8_dec(x->x_buf, &x->x_selstart);
         while (x->x_selstart > 0 && x->x_buf[x->x_selstart] != '\n')
-            x->x_selstart--;
+            u8_dec(x->x_buf, &x->x_selstart);
         x->x_selend = x->x_selstart;
     }
     else if (!strcmp(keysym->s_name, "Down"))
     {
         while (x->x_selend < x->x_bufsize &&
             x->x_buf[x->x_selend] != '\n')
-            x->x_selend++;
+            u8_inc(x->x_buf, &x->x_selend);
         if (x->x_selend < x->x_bufsize)
-            x->x_selend++;
+            u8_inc(x->x_buf, &x->x_selend);
         x->x_selstart = x->x_selend;
     }
     rtext_senditup(x, SEND_UPDATE, &w, &h, &indx);
Index: src/s_utf8.h
===================================================================
--- src/s_utf8.h        (revision 0)
+++ src/s_utf8.h        (revision 0)
@@ -0,0 +1,88 @@
+#ifndef S_UTF8_H
+#define S_UTF8_H
+
+/*--moo--*/
+#ifndef u_int32_t
+# define u_int32_t unsigned int
+#endif
+
+#ifndef UCS4
+# define UCS4 u_int32_t
+#endif
+
+/* UTF8_SUPPORT_FULL_UCS4
+ *  define this to support the full potential range of UCS-4 codepoints
+ *  (in anticipation of a future UTF-8 standard)
+ */
+/*#define UTF8_SUPPORT_FULL_UCS4 1*/
+#undef UTF8_SUPPORT_FULL_UCS4
+
+/* UTF8_MAXBYTES
+ *   maximum number of bytes required to represent a single character in UTF-8
+ *
+ * UTF8_MAXBYTES1 = UTF8_MAXBYTES+1 
+ *  maximum bytes per character including NUL terminator
+ */
+#ifdef UTF8_SUPPORT_FULL_UCS4
+# ifndef UTF8_MAXBYTES
+#  define UTF8_MAXBYTES  6
+# endif
+# ifndef UTF8_MAXBYTES1
+#  define UTF8_MAXBYTES1 7
+# endif
+#else
+# ifndef UTF8_MAXBYTES
+#  define UTF8_MAXBYTES  4
+# endif
+# ifndef UTF8_MAXBYTES1
+#  define UTF8_MAXBYTES1 5
+# endif
+#endif
+/*--/moo--*/
+
+/* is c the start of a utf8 sequence? */
+#define isutf(c) (((c)&0xC0)!=0x80)
+
+/* convert UTF-8 data to wide character */
+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz);
+
+/* the opposite conversion */
+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
+
+/* moo: get byte length of character number, or 0 if not supported */
+int u8_wc_nbytes(u_int32_t ch);
+
+/* moo: compute required storage for UTF-8 encoding of 's[0..n-1]' */
+int u8_wcs_nbytes(u_int32_t *ucs, int size);
+
+/* single character to UTF-8, no NUL termination */
+int u8_wc_toutf8(char *dest, u_int32_t ch);
+
+/* moo: single character to UTF-8, with NUL termination */
+int u8_wc_toutf8_nul(char *dest, u_int32_t ch);
+
+/* character number to byte offset */
+int u8_offset(char *str, int charnum);
+
+/* byte offset to character number */
+int u8_charnum(char *s, int offset);
+
+/* return next character, updating an index variable */
+u_int32_t u8_nextchar(char *s, int *i);
+
+/* move to next character */
+void u8_inc(char *s, int *i);
+
+/* move to previous character */
+void u8_dec(char *s, int *i);
+
+/* moo: move pointer to next character */
+void u8_inc_ptr(char **sp);
+
+/* moo: move pointer to previous character */
+void u8_dec_ptr(char **sp);
+
+/* returns length of next utf-8 sequence */
+int u8_seqlen(char *s);
+
+#endif /* S_UTF8_H */

test-utf8.pd
Description: application/puredata

_______________________________________________
Pd-dev mailing list
Pd-dev@iem.at
http://lists.puredata.info/listinfo/pd-dev

Re: [PD-dev] UTF-8 for pd-devel (again)

Reply via email to