On Apr 11 2007 20:28, Egmont Koblinger wrote: >I send a reworked version of the patch. > >Removed from the first version: > - any sign of '.' as substitute glyph > - don't ignore zero-width characters (except for a few zero-width spaces > that are ignored in the current kernel too). However, I kept the code > organized and commented so that someone can have the other behavior very > easily (by removing a pair of comment signs). > >Kept features, fixes: > - lots of UTF-8 decoder fixes. Emit one U+FFFD for every standalone > continuation byte and for every incomplete sequence, as Markus Kuhn > recommends. Reject overlong sequences too. > - D800..DFFF and FFFE..FFFF are substituted by FFFD too, since these are > not valid Unicode code points. > - no "random" replacement glyph (e.g. u with double acute instead of > u with circumflex) in UTF-8 mode > - if U+FFFD is not found in the font, the fallback replacement '?' (ascii > question mark) is printed with inverse color attributes > - U+200A was ignored so far as a zero-width space character. I think it > was a mistake, it's not zero-width. > - print an extra space for double-wide characters for the cursor to stand > at the right place. Yet again the code is organized so that it is very > easy to change to jump only one character cell, should someone prefer > that behavior (which I still see no good reason to). > >Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]> > >@@ -1934,6 +1943,99 @@ > char con_buf[CON_BUF_SIZE]; > DECLARE_MUTEX(con_buf_sem); > >+/* is_{zero,double}_width() are based on the wcwidth() implementation by >+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0) >+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c >+ */ >+struct interval { >+ int first; >+ int last; >+};
CodingStyle? uint16_t instead of int? >+static int is_zero_width(long ucs) >+{ >+ static const struct interval zero_width[] = { >+ { 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 }, [...] >+ { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 }, >+ { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 }, >+ { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD }, >+ { 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF } >+ }; Since Unicode above 0xFFFF is unsupported, could not these entries be killed? >+static int is_double_width(long ucs) >+{ >+ static const struct interval double_width[] = { >+ { 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E }, >+ { 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF }, >+ { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 }, >+ { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD } >+ }; Similarly. >@@ -1950,6 +2052,10 @@ > unsigned int currcons; > unsigned long draw_from = 0, draw_to = 0; > struct vc_data *vc; >+ unsigned char vc_attr; >+ int rescan; unsigned int rescan:1; >+ int inverse; unsigned int inverse:1; >+ int width; unsigned int width; or even uint8_t. > u16 himask, charmask; > const unsigned char *orig_buf = NULL; > int orig_count; >@@ -2012,51 +2118,81 @@ > buf++; > n++; > count--; >+ rescan = 0; >+ inverse = 0; >+ width = 1; > > /* Do no translation at all in control states */ > if (vc->vc_state != ESnormal) { > tc = c; > } else if (vc->vc_utf && !vc->vc_disp_ctrl) { >- /* Combine UTF-8 into Unicode */ >- /* Malformed sequences as sequences of replacement glyphs */ >+ /* Combine UTF-8 into Unicode in vc_utf_char */ >+ /* vc_utf_count is the number of continuation bytes still >expected to arrive */ >+ /* vc_npar is the number of continuation bytes arrived so >far */ > rescan_last_byte: >- if(c > 0x7f) { >+ if ((c & 0xc0) == 0x80) { >+ /* Continuation byte received */ >+ static const int utf8_length_changes[] = { 0x0000007f, >0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff }; I would not mind unsigned. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/