subject:"\[PATCH\] console UTF\-8 fixes"

Re: [PATCH] console UTF-8 fixes

2007-06-19 Thread Bodo Eggert

On Tue, 19 Jun 2007, Egmont Koblinger wrote:
> On Tue, Jun 19, 2007 at 03:54:52PM +0200, Bodo Eggert wrote:
> 
> > Does the FLUSH DTRT by design, or does it just shrink and hide the original
> > race?



> But you may be right: yes, it might be a bug (or misfeature) in the FB code,
> too. Could you please look after it? I don't think I'd be able to do it and
> have time for this.

I have neither the insight nor the time to do that. I just read the bug 
description and drew my conclusions.
-- 
Never stand when you can sit, never sit when you can lie down, never stay
awake when you can sleep.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-06-19 Thread Egmont Koblinger

On Tue, Jun 19, 2007 at 03:54:52PM +0200, Bodo Eggert wrote:

> Does the FLUSH DTRT by design, or does it just shrink and hide the original
> race?

I haven't deeply studied this aspect of the source, don't know what
_exactly_ this FLUSH does, but of course I have an inner feeling about this.
Kind of this macro does the _real_ print on the screen. It actually calls
vc->vc_sw->con_putcs, whereas I guess vc_sw refers to the particular
implementation (vga, fb) and con_putcs might be doing the real job.

Without these two new FLUSHes the behavior was very funny: applications
(e.g. text editors) highlighed incorrect cells, but either switching between
VTs, or gpming over these cells corrected the display. So the logical
content behind the scenes (reflected by /dev/vcsa*) was maintained
correctly, the bug is somewhere when the content is first displayed.

I saw that this FLUSH was already used 3 times inside do_con_write(), I
placed it here 2 more times (when changing color attr to inverse and when
reverting to normal). I think it cannot hurt, and apparently it fixes the
problem.

But you may be right: yes, it might be a bug (or misfeature) in the FB code,
too. Could you please look after it? I don't think I'd be able to do it and
have time for this.

> And why wasn't VGA affected, too?

I guess tt must be some difference between the actual VGA and FB code that
is called from here. I don't know whether it's a bug in FB or just some
difference.

Perhaps it's that the FB driver's con_putcs() call uses vc->vc_attr (the
current attr for the console) for each char cell to be printed, instead of
the own attr values of the cells themselves. Just an idea...

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-06-19 Thread Bodo Eggert

Egmont Koblinger <[EMAIL PROTECTED]> wrote:

> 2. My patch introduced "question mark with inverted color attributes" as a
>last resort fallback glyph. Though it perfectly works on VGA console, on
>framebuffer you may end up with question marks that are highlighed but
>shouldn't be, and normal characters that are accidentally highlighed.
>This is caused by missing FLUSHes when changing the color attribute.

Does the FLUSH DTRT by design, or does it just shrink and hide the original
race? And why wasn't VGA affected, too?
-- 
Backups? We doan NEED no steenking baX%^~,VbKx NO CARRIER 

Friß, Spammer: [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-06-19 Thread Egmont Koblinger

Hi folks,

Recently my console UTF-8 patch went mainline. Here I send a very small
additinal patch that fixes two nasty issues and improves a third one,
namely:

1. My patch changed the behavior if a glyph is not found in the Unicode
   mapping table. Previously for Unicode values less than 256 or 512 the
   kernel tried to display the glyph from that position of the glyph table,
   which could lead to a different accented letter being displayed. I
   removed this fallback possibility and changed it to display the
   replacement symbol.

   As Behdad pointed out, some fonts (e.g. sun12x22 from the kbd package)
   lack Unicode mapping information, hence all you get is lots of question
   marks. Though theoretically it's actually a user-space bug (the font
   should be fixed), Behdad and I both believe that it'd be good to work
   around in the kernel by re-introducing the fallback solution for ASCII
   characters only. This sounds a quite reasonable decision, since all fonts
   ship the ASCII characters in the first 128 positions. This way users
   won't be surprised by lots of question marks just because s/he issued a
   not-so-perfectly parameterized setfont command. As this fallback is only
   re-introduced for code points below 128, you still won't see an accented
   letter replaced by another, but at least you'll always get the English
   letters right.

2. My patch introduced "question mark with inverted color attributes" as a
   last resort fallback glyph. Though it perfectly works on VGA console, on
   framebuffer you may end up with question marks that are highlighed but
   shouldn't be, and normal characters that are accidentally highlighed.
   This is caused by missing FLUSHes when changing the color attribute.

3. I've updated the table of double-width character based on Markus's
   updated version. Only ten new code poings (one interval) is added.


Please review and commit this patch... I'm afraid it might be too late, but
it would be nice to see it in 2.6.22 -- as this will be the first stable
release that introduces my work, and the first two issues addressed now are
regressions since 2.6.21.

Thanks,

Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>

diff -Naur linux-2.6.22-rc5.orig/drivers/char/vt.c 
linux-2.6.22-rc5/drivers/char/vt.c
--- linux-2.6.22-rc5.orig/drivers/char/vt.c 2007-06-17 04:09:12.0 
+0200
+++ linux-2.6.22-rc5/drivers/char/vt.c  2007-06-19 13:49:55.0 +0200
@@ -1956,7 +1956,7 @@
 DEFINE_MUTEX(con_buf_mtx);
 
 /* is_double_width() is based on the wcwidth() implementation by
- * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
+ * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
  * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
  */
 struct interval {
@@ -1988,8 +1988,8 @@
static const struct interval double_width[] = {
{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
-   { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 },
-   { 0x2, 0x2FFFD }, { 0x3, 0x3FFFD }
+   { 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
+   { 0xFFE0, 0xFFE6 }, { 0x2, 0x2FFFD }, { 0x3, 0x3FFFD }
};
return bisearch(ucs, double_width,
sizeof(double_width) / sizeof(*double_width) - 1);
@@ -2187,9 +2187,12 @@
continue; /* nothing to display */
}
/* Glyph not found */
-   if (!(vc->vc_utf && !vc->vc_disp_ctrl) && !(c & 
~charmask)) {
+   if ((!(vc->vc_utf && !vc->vc_disp_ctrl) || c < 
128) && !(c & ~charmask)) {
/* In legacy mode use the glyph we get by a 
1:1 mapping.
-  This would make absolutely no sense with 
Unicode in mind. */
+  This would make absolutely no sense with 
Unicode in mind,
+  but do this for ASCII characters since a 
font may lack
+  Unicode mapping info and we don't want 
to end up with
+  having question marks only. */
tc = c;
} else {
/* Display U+FFFD. If it's not found, 
display an inverse question mark. */
@@ -2213,6 +2216,7 @@
} else {
vc_attr = ((vc->vc_attr) & 0x88) | 
(((vc->vc_attr) & 0x70) >> 4) | (((vc->vc_attr) & 0x07) << 4);
}
+   FLUSH
}
 
while (1) {
@@ -2246,6 +2250,10 @@
if (tc < 0) tc = ' ';
}
 
+   if (in

[PATCH] console UTF-8 fixes

2007-04-17 Thread Egmont Koblinger

Hi Andrew,

I've been told to send this patch to you for inclusion in your patchset, and
hopefully sooner or later in the mainline kernel too. It has been discussed
on lkml and finally found to be OK by HPA and Jan.

The UTF-8 part of the vt driver suffers from the following issues which are
addressed in my patch:

1) If there's no glyph found for a particular valid UTF-8 character, we try
   to display U+FFFD. However if this one is not found either, here's what
   the current kernel does:

   - First, if the Unicode value is less than the number of glyphs, use the
 glyph directly from that position of the glyph table. While it may be a
 good idea in the 8-bit world, it has absolutely no sense with Unicode
 in mind. For example, if a Latin-2 font is loaded and an application
 prints U+00FB ("u with circumflex", not present in Latin-2) then as a
 fallback solution the glyph from the 0xFB position of the Latin-2
 fontset (which is an "u with double accent" - a different character) is
 displayed.

   - Second, if this fallback fails too, a simple ASCII question mark is
 printed, which is visually undistinguishable from a real question mark.

   I changed the code to skip the first step (except if in non-UTF-8 mode),
   and changed the second step to print the question mark with inverse color
   attributes, so it is visually clear that it's not a real question mark,
   and resembles more to the common glyph of U+FFFD.

2) The UTF-8 decoder is buggy in many ways:

   - Lone continuation bytes (section 3.1 of Markus Kuhn's UTF-8 stress
 test) are not caught, they are displayed as some "random" (taken
 directly form the font table, see above) glyphs instead the replacement
 character.

   - Incomplete sequences (sections 3.2 and 3.3 of the stress test) emit no
 replacement character, but rather cause the subsequent valid character
 to be displayed more times(!).

   - The decoder is not safe: overlong sequences are not caught currently,
 they are displayed as if these were valid representations. This may
 even have security impacts.

   - The decoder does not handle D800..DFFF and FFFE.. specially, it
 just emits these code points and lets it be looked up in the glyph
 table. Since these are invalid code points, I replace them by U+FFFD
 and hence give no chance for them to be looked up in the glyph table. 
 (Assuming no font ships glyphs for these code points, this change is
 not visible to the users since the glyph shown will be the same.)

   With my fixes to the decoder it now behaves exactly as Markus Kuhn's
   stress test recommends.

3) It has no concept of double-width (CJK) characters. It's way beyond the
   scope of my patch to try to display them, but at least I think it's
   important for the cursor to jump two positions when printing such
   characters, since this is what applications (such as text editors)
   expect. Currently the cursor only jumps one position, and hence
   applications suffer from displaying and refreshing problems, and editing
   some English letters that are preceded by some CJK characters in the same
   line is a nightmare. With my patch an additional space is inserted after
   the CJK character has been printed (which usually means a replacement
   symbol of course). (If U+FFFD isn't availble and hence an inverse
   question mark is displayed in the first cell, I keep the inverted state
   for the space in the 2nd column so it's quite easy to see that they are
   tied together.)

4) There is a small built-in table of zero-width spaces that are not to be
   printed but silently skipped. U+200A is included there, but it's not a
   zero-width character, so I remove it from there.


The patch is exactly the same as I've sent to the list the last time
(labelled as "version 4").


Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>

diff -Naur linux-2.6.20.orig/drivers/char/consolemap.c 
linux-2.6.20/drivers/char/consolemap.c
--- linux-2.6.20.orig/drivers/char/consolemap.c 2007-02-04 19:44:54.0 
+0100
+++ linux-2.6.20/drivers/char/consolemap.c  2007-04-11 18:45:45.0 
+0200
@@ -626,10 +626,10 @@
   
/* Only 16-bit codes supported at this time */
if (ucs > 0x)
-   ucs = 0xfffd;   /* U+FFFD: REPLACEMENT CHARACTER */
-   else if (ucs < 0x20 || ucs >= 0xfffe)
+   return -4;  /* Not found */
+   else if (ucs < 0x20)
return -1;  /* Not a printable character */
-   else if (ucs == 0xfeff || (ucs >= 0x200a && ucs <= 0x200f))
+   else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
return -2;  /* Zero-width space */
/*
 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
diff -Naur linux-2.6.20.orig/drivers/char/vt.c linux-2.6.20/drivers/char/vt.c
--- linux-2.6.20.orig/drivers/char/vt.c 2007-02-04 19:44:54.00

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Jan Engelhardt


On Apr 12 2007 18:55, Egmont Koblinger wrote:
>
>> >> I've been thinking on it and I'm not sure which one the right
>> >> way is. The reason for choosing this was probably that this way
>> >> information that is not used by the code can be omitted by the
>> >> compiler.
>> >
>> > Then let's leave it out of the source.
>> 
>> I however will put this into my rpm if it works/isstable.
>
>We've arrived at another coding policy :)
>
>There are two possible behaviors, each have pros and cons. HPA
>prefers one, while Jan and me would prefer the other.

I don't really care how it's done - however it should at least not
hurt in the eyes (<->CodingStyle). I just welcome this
functionality/bugfix, because I find myself editing text with
undisplayable characters more often than not. Should be ok now.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Roman Zippel

Hi,

On Thu, 12 Apr 2007, Egmont Koblinger wrote:

> On Thu, Apr 12, 2007 at 05:52:49PM +0200, Roman Zippel wrote:
> 
> > Well, it often doesn't end there, other users may report these as bugs and 
> > want to get them fixed, so we have to look ahead a little for possible 
> > problems.
> 
> They may even report that the current behavior of not knowing anything about
> double-width characters is wrong. If Unicode is updated and someone reports
> the kernel code is outdated, it's going to be fairly easy to update that one
> as well.

As far as Unicode is concerned we could go really wild here, e.g. how do 
we handle decomposed characters, there are also halfwidth characters...
So the primary question is still: is this a kernel problem? The kernel 
only implements a simple fixed char width output device. Does it really 
make sense to pretend being able to handle characters, we can't display 
in first place?

> > It has everything to do with this. It describes the terminal capabilities, 
> > so why should variable width support not one part of it?
> 
> Nice idea. Terminfo database should tell whether a particular terminal
> supports double-width characters. Applications should expect the cursor to
> move one or two cells depending on this. So both terminfo databases as well
> as plenty of libraries and applications would need update.

If everyone reimplements this kind of stuff, something is seriously 
wrong...

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread H. Peter Anvin


Egmont Koblinger wrote:

On Thu, Apr 12, 2007 at 10:35:24AM -0700, H. Peter Anvin wrote:


Yes, I didn't realize at the time that that was dead code.


:-

Version 4 of the patch follows. Dead code omitted from version 3.


Looks good to me.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 10:35:24AM -0700, H. Peter Anvin wrote:

> Yes, I didn't realize at the time that that was dead code.

:-

Version 4 of the patch follows. Dead code omitted from version 3.


Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>

diff -Naur linux-2.6.20.orig/drivers/char/consolemap.c 
linux-2.6.20/drivers/char/consolemap.c
--- linux-2.6.20.orig/drivers/char/consolemap.c 2007-02-04 19:44:54.0 
+0100
+++ linux-2.6.20/drivers/char/consolemap.c  2007-04-11 18:45:45.0 
+0200
@@ -626,10 +626,10 @@
   
/* Only 16-bit codes supported at this time */
if (ucs > 0x)
-   ucs = 0xfffd;   /* U+FFFD: REPLACEMENT CHARACTER */
-   else if (ucs < 0x20 || ucs >= 0xfffe)
+   return -4;  /* Not found */
+   else if (ucs < 0x20)
return -1;  /* Not a printable character */
-   else if (ucs == 0xfeff || (ucs >= 0x200a && ucs <= 0x200f))
+   else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
return -2;  /* Zero-width space */
/*
 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
diff -Naur linux-2.6.20.orig/drivers/char/vt.c linux-2.6.20/drivers/char/vt.c
--- linux-2.6.20.orig/drivers/char/vt.c 2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20/drivers/char/vt.c  2007-04-12 19:38:51.0 +0200
@@ -1934,6 +1934,46 @@
 char con_buf[CON_BUF_SIZE];
 DECLARE_MUTEX(con_buf_sem);
 
+/* is_double_width() is based on the wcwidth() implementation by
+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+struct interval {
+   uint32_t first;
+   uint32_t last;
+};
+
+static int bisearch(uint32_t ucs, const struct interval *table, int max)
+{
+   int min = 0;
+   int mid;
+
+   if (ucs < table[0].first || ucs > table[max].last)
+   return 0;
+   while (max >= min) {
+   mid = (min + max) / 2;
+   if (ucs > table[mid].last)
+   min = mid + 1;
+   else if (ucs < table[mid].first)
+   max = mid - 1;
+   else
+   return 1;
+   }
+   return 0;
+}
+
+static int is_double_width(uint32_t ucs)
+{
+   static const struct interval double_width[] = {
+   { 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
+   { 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
+   { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 },
+   { 0x2, 0x2FFFD }, { 0x3, 0x3FFFD }
+   };
+   return bisearch(ucs, double_width,
+   sizeof(double_width) / sizeof(*double_width) - 1);
+}
+
 /* acquires console_sem */
 static int do_con_write(struct tty_struct *tty, const unsigned char *buf, int 
count)
 {
@@ -1950,6 +1990,10 @@
unsigned int currcons;
unsigned long draw_from = 0, draw_to = 0;
struct vc_data *vc;
+   unsigned char vc_attr;
+   uint8_t rescan;
+   uint8_t inverse;
+   uint8_t width;
u16 himask, charmask;
const unsigned char *orig_buf = NULL;
int orig_count;
@@ -2012,51 +2056,81 @@
buf++;
n++;
count--;
+   rescan = 0;
+   inverse = 0;
+   width = 1;
 
/* Do no translation at all in control states */
if (vc->vc_state != ESnormal) {
tc = c;
} else if (vc->vc_utf && !vc->vc_disp_ctrl) {
-   /* Combine UTF-8 into Unicode */
-   /* Malformed sequences as sequences of replacement glyphs */
+   /* Combine UTF-8 into Unicode in vc_utf_char */
+   /* vc_utf_count is the number of continuation bytes still 
expected to arrive */
+   /* vc_npar is the number of continuation bytes arrived so 
far */
 rescan_last_byte:
-   if(c > 0x7f) {
+   if ((c & 0xc0) == 0x80) {
+   /* Continuation byte received */
+   static const uint32_t utf8_length_changes[] = { 
0x007f, 0x07ff, 0x, 0x001f, 0x03ff, 0x7fff };
if (vc->vc_utf_count) {
-  if ((c & 0xc0) == 0x80) {
-  vc->vc_utf_char = (vc->vc_utf_char << 6) 
| (c & 0x3f);
-  if (--vc->vc_utf_count) {
-  vc->vc_npar++;
-  continue;
-  }
-  tc = c = vc->vc_utf_char;
-  } else
-  goto replacement_glyph;
-   } else {
-

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread H. Peter Anvin


Egmont Koblinger wrote:

On Thu, Apr 12, 2007 at 09:58:38AM -0700, H. Peter Anvin wrote:

Not leaving dead code in the kernel is long-standing policy; it's 
nothing new.  We constantly remove #if 0'd code that the authors have 
left in.


I see. However, you wrote it recently:


Besides, would it not make more sense to have a single table with the
width information, if you insist on having one, instead of multiple ones?


If I rework the code according to this then the zero-width information will
no longer be dead code, right? The unchanged kernel would be slightly larger
and would run slightly slower, but it would be much easier to patch it to
reach the other possible behavior regarding combining characters.



Yes, I didn't realize at the time that that was dead code.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 09:58:38AM -0700, H. Peter Anvin wrote:

> Not leaving dead code in the kernel is long-standing policy; it's 
> nothing new.  We constantly remove #if 0'd code that the authors have 
> left in.

I see. However, you wrote it recently:

> Besides, would it not make more sense to have a single table with the
> width information, if you insist on having one, instead of multiple ones?

If I rework the code according to this then the zero-width information will
no longer be dead code, right? The unchanged kernel would be slightly larger
and would run slightly slower, but it would be much easier to patch it to
reach the other possible behavior regarding combining characters.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread H. Peter Anvin


Egmont Koblinger wrote:


We've arrived at another coding policy :)

There are two possible behaviors, each have pros and cons. HPA prefers one,
while Jan and me would prefer the other. The difference is one function that
contains a large table and an invocation of that function in a small if
branch.

In my latest version of the patch I've put the function itself inside
comments too. So it's not the real compiler, it's the preprocessor that
omits this code for the default behavior. HPA, I hope you don't mind if the
other reasonable behavior is there in the source, within comments. I think
let's make the job easier for those who have a different opinion. Or is it a
completely stupid idea?



Not leaving dead code in the kernel is long-standing policy; it's 
nothing new.  We constantly remove #if 0'd code that the authors have 
left in.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 06:41:22PM +0200, Jan Engelhardt wrote:

> >> I've been thinking on it and I'm not sure which one the right way is. The
> >> reason for choosing this was probably that this way information that is
> >> not
> >> used by the code can be omitted by the compiler.
> >
> > Then let's leave it out of the source.
> 
> I however will put this into my rpm if it works/isstable.

We've arrived at another coding policy :)

There are two possible behaviors, each have pros and cons. HPA prefers one,
while Jan and me would prefer the other. The difference is one function that
contains a large table and an invocation of that function in a small if
branch.

In my latest version of the patch I've put the function itself inside
comments too. So it's not the real compiler, it's the preprocessor that
omits this code for the default behavior. HPA, I hope you don't mind if the
other reasonable behavior is there in the source, within comments. I think
let's make the job easier for those who have a different opinion. Or is it a
completely stupid idea?

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Jan Engelhardt


On Apr 12 2007 08:36, H. Peter Anvin wrote:
> Egmont Koblinger wrote:
>
>> > Besides, would it not make more sense to have a single table with the
>> > width information, if you insist on having one, instead of multiple
>> > ones?
>> 
>> I've been thinking on it and I'm not sure which one the right way is. The
>> reason for choosing this was probably that this way information that is
>> not
>> used by the code can be omitted by the compiler.
>
> Then let's leave it out of the source.

I however will put this into my rpm if it works/isstable.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 05:52:49PM +0200, Roman Zippel wrote:

> Well, it often doesn't end there, other users may report these as bugs and 
> want to get them fixed, so we have to look ahead a little for possible 
> problems.

They may even report that the current behavior of not knowing anything about
double-width characters is wrong. If Unicode is updated and someone reports
the kernel code is outdated, it's going to be fairly easy to update that one
as well.

I still can't see why my proposed version is by any means worse than the
current one. (I can see however that my version is still not perfect.)

> It has everything to do with this. It describes the terminal capabilities, 
> so why should variable width support not one part of it?

Nice idea. Terminfo database should tell whether a particular terminal
supports double-width characters. Applications should expect the cursor to
move one or two cells depending on this. So both terminfo databases as well
as plenty of libraries and applications would need update. I'll ask Thomas
E. Dickey for his opinion.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Roman Zippel

Hi,

On Thu, 12 Apr 2007, Egmont Koblinger wrote:

> > Considering this possible volatility I'm not certain we really need this 
> > in the kernel.
> > The other point is that I have problems imagining, that this should be 
> > enough to edit random text files with a random editor without problems. 
> 
> No, this would not be enough for all special corner cases. It would just be
> better than currently it is. That's my goal.

Well, it often doesn't end there, other users may report these as bugs and 
want to get them fixed, so we have to look ahead a little for possible 
problems.

> > OTOH if the editor has to all this parsing anyway, the whole thing could 
> > be pushed to userspace and the Linux terminal could be marked as handling 
> > all characters equally (a good hint would be if the terminal doesn't even 
> > support wide characters). The terminfo database exists for a good reason 
> 
> Currently not even applications and terminal emulators agree on the width. I
> don't think terminfo has anything to do with it.

It has everything to do with this. It describes the terminal capabilities, 
so why should variable width support not one part of it?

> So, there are several (maybe a few hundred or few thousands) characters that
> are handled differently by different applications/libraries. On the other
> hand, there are approximately 42.000 characters in BMP that are double-width
> according to every width specification or implementation I've seen so far.
> That's roughly the 2/3 of all the code points in BMP. Why is a problem if
> the kernel knows to jump 2 character cells on them?

Wrong question: why is it a problem the kernel doesn't know this?
It's usually easy to "fix" this in the kernel, because it's last in line, 
but we have to look at the whole picture and the kernel isn't solely 
responsible for unicode handling, so why can't this be fixed at 
application level?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread H. Peter Anvin


Egmont Koblinger wrote:


I don't think width information for characters in BMP is going to change
that often.

By the way, a note about the size: the larger one of the two tables is
unused and hence optimised away by the compiler. I just left in the source
so that it only takes a minor modification for people go get a different
sane behavior (ie. ignore combining chars). So only the small table, with 11
pairs of longs (88 bytes) are compiled to the kernel.



Every version has added combining chars.  But anyway, please don't leave 
unused code in the kernel.  I agree doublewidth characters are largely 
range-based and thus not all that likely to change.


At least please put them in a separate .c file and include a script to 
generate them clean from UnicodeData.txt.


I'll look at it, but I didn't want to alter the building procedure, modify
Makefiles... Or do you mean I should only ship the generated .c file plus
the script, instead of the (1MB) UnicodeData.txt and generating it compile
time? Sounds reasonable...


Right.  However, see above.

Besides, would it not make more sense to have a single table with the 
width information, if you insist on having one, instead of multiple ones?


I've been thinking on it and I'm not sure which one the right way is. The
reason for choosing this was probably that this way information that is not
used by the code can be omitted by the compiler.


Then let's leave it out of the source.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 04:38:54PM +0200, Roman Zippel wrote:

> Considering this possible volatility I'm not certain we really need this 
> in the kernel.
> The other point is that I have problems imagining, that this should be 
> enough to edit random text files with a random editor without problems. 

No, this would not be enough for all special corner cases. It would just be
better than currently it is. That's my goal.

> OTOH if the editor has to all this parsing anyway, the whole thing could 
> be pushed to userspace and the Linux terminal could be marked as handling 
> all characters equally (a good hint would be if the terminal doesn't even 
> support wide characters). The terminfo database exists for a good reason 

Currently not even applications and terminal emulators agree on the width. I
don't think terminfo has anything to do with it. Width information should
come from glibc, but unfortunately its database is quite old, so
applications tend to implement their own version. (Example: according to
glibc 2.5, U+0221 is not printable. Still it's present in many fonts, and at
least vim and joe display them.)

So, there are several (maybe a few hundred or few thousands) characters that
are handled differently by different applications/libraries. On the other
hand, there are approximately 42.000 characters in BMP that are double-width
according to every width specification or implementation I've seen so far.
That's roughly the 2/3 of all the code points in BMP. Why is a problem if
the kernel knows to jump 2 character cells on them?

I'm not seeking for a perfect solution. (Taking a look at the current state
of specs/libs/apps perfect solution doesn't even exist.) I'm seeking for
something that is just way better than the current situation.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Roman Zippel

Hi,

On Thu, 12 Apr 2007, Egmont Koblinger wrote:

> I tried to create such a script using ideas for regexps from glibc's
> charmaps/UTF-8, but it seemed to be quite hopeless to create a small table.
> It seems that Markus probably performed some reasonal manual optimisations
> that cannot really be scripted. For example, within a huge CJK area there
> are plenty of codepoints that are not (yet?) assigned. However, if they'll
> be assigned in the future, most likely they'll be double-wide characters
> too.

Considering this possible volatility I'm not certain we really need this 
in the kernel.
The other point is that I have problems imagining, that this should be 
enough to edit random text files with a random editor without problems. 
Especially cursor positioning becomes a whole lot more complex and the 
editor has to do exactly the same parsing as the kernel, i.e. changes to 
this table are not easily possible anymore or later you may need to add 
the possibility to read the table.
OTOH if the editor has to all this parsing anyway, the whole thing could 
be pushed to userspace and the Linux terminal could be marked as handling 
all characters equally (a good hint would be if the terminal doesn't even 
support wide characters). The terminfo database exists for a good reason 
and a good editor has to support a wide range of terminals anyway, so I 
don't really see the problem as to mark variable width characters as not 
supported and keep the kernel implementation sane and simple.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Thu, Apr 12, 2007 at 02:13:06PM +0100, Alan Cox wrote:

> You can pack them a little differently and they'll shrink a lot.

The smaller table would actually slightly grow instead of shrinking. In my
patch there are 11 intervals, each consume 2*4 bytes, that's 88 bytes. Your
variant would store each interval in 3 bytes, that's 33 bytes so far, plus
you need 4 byte values in the toptab array, that's 64 more bytes.

The larger table is 984 byte large now (123 intervals). You cannot compress
each interval to 3 bytes (at least not the way you described) since the
index for toptab needs more than 16 possible values. But it's easily
possible to compresss each interval in 4 bytes. That's 492 bytes for the
intervals themselves, plus at least 64 bytes for toptab (oh, probably it's
possible to store 3 byte values in toptab, too).

You might gain approx. 400 bytes in a table that's actually commented out in
the current patch since HPA doesn't like the behavior where it would be
used. On the other hand, you get a database that is harder to understand,
maintain, verify, you get code that runs a little bit slower. We're from
different worlds. You're a great kernel hacker and bit-magician. I'm rather
developing applications, so for me having a less straightforward code to
save 400 bytes is simply not worth it.

Which version would fit in the spirit of the kernel better? I don't know,
I'd let you decide it :)


bye,
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Alan Cox

> So please accept these hard-coded tables in the first round. Maybe one day
> somebody will come up with a better solution. This one should be okay until
> then. (I can also send the script I've written so he can improve it.)

You can pack them a little differently and they'll shrink a lot.

Firstly store
start, size

Secondly pack the offset as 16 bit with the top 4 bits

0-3 value
4 - F
5 - 1D
6 - E0

And the size as an 8 bit range. You need to open code the rule for 
2/3 but that is tiny anyway.

Unpack is trivial

start = (toptab[*entry>>4]|(*entry&0x0F)) << 8 | entry[1];
end = start + entry[2];

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

Hi,

I send the third version. No major modifications from the second version,
only small cleanups, coding style...

H. Peter Anvin wrote:

> I'm still unhappy about these large search tables in the kernel, not 
> because they take a huge amount of space (it's not that much), but 
> because they're invariably going to be stale, as they're Unicode-version 
> dependent.
> 
> At least please put them in a separate .c file and include a script to 
> generate them clean from UnicodeData.txt.

I tried to create such a script using ideas for regexps from glibc's
charmaps/UTF-8, but it seemed to be quite hopeless to create a small table.
It seems that Markus probably performed some reasonal manual optimisations
that cannot really be scripted. For example, within a huge CJK area there
are plenty of codepoints that are not (yet?) assigned. However, if they'll
be assigned in the future, most likely they'll be double-wide characters
too. A simple script, on the other hand, would skip them and create more
smaller intervals of CJK characters. However, automatically assuming
double-width for consecutive sequence of unassigned codepoints between two
double-wide character may not be a good idea either.

For zero-width the situation is the same. I couldn't find which piece of
information within UnicodeData.txt I could 100% rely on to create a simple
table.

So please accept these hard-coded tables in the first round. Maybe one day
somebody will come up with a better solution. This one should be okay until
then. (I can also send the script I've written so he can improve it.)

Jan: "unsigned int rescan:1" only seems to work within structure
definitions, but you recommended it for variables inside a function. So now
I use uint8_t.

Pavel and Jan: rbtree is an excellent data structure if you have to modify
the contents run-time. Here we have a constant data structure, and static
initialization of an rbtree (with lots of pointers and other internal data)
seems to be (quite) impossible for me.

Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>

diff -Naur linux-2.6.20.orig/drivers/char/consolemap.c 
linux-2.6.20/drivers/char/consolemap.c
--- linux-2.6.20.orig/drivers/char/consolemap.c 2007-02-04 19:44:54.0 
+0100
+++ linux-2.6.20/drivers/char/consolemap.c  2007-04-11 18:45:45.0 
+0200
@@ -626,10 +626,10 @@

/* Only 16-bit codes supported at this time */
if (ucs > 0x)
-   ucs = 0xfffd;   /* U+FFFD: REPLACEMENT CHARACTER */
-   else if (ucs < 0x20 || ucs >= 0xfffe)
+   return -4;  /* Not found */
+   else if (ucs < 0x20)
return -1;  /* Not a printable character */
-   else if (ucs == 0xfeff || (ucs >= 0x200a && ucs <= 0x200f))
+   else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
return -2;  /* Zero-width space */
/*
 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
diff -Naur linux-2.6.20.orig/drivers/char/vt.c linux-2.6.20/drivers/char/vt.c
--- linux-2.6.20.orig/drivers/char/vt.c 2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20/drivers/char/vt.c  2007-04-12 14:18:05.0 +0200
@@ -1934,6 +1934,97 @@
 char con_buf[CON_BUF_SIZE];
 DECLARE_MUTEX(con_buf_sem);

+/* is_{zero,double}_width() are based on the wcwidth() implementation by
+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+struct interval {
+   uint32_t first;
+   uint32_t last;
+};
+
+static int bisearch(uint32_t ucs, const struct interval *table, int max)
+{
+   int min = 0;
+   int mid;
+
+   if (ucs < table[0].first || ucs > table[max].last)
+   return 0;
+   while (max >= min) {
+   mid = (min + max) / 2;
+   if (ucs > table[mid].last)
+   min = mid + 1;
+   else if (ucs < table[mid].first)
+   max = mid - 1;
+   else
+   return 1;
+   }
+   return 0;
+}
+
+/*
+static int is_zero_width(uint32_t ucs)
+{
+   static const struct interval zero_width[] = {
+   { 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
+   { 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
+   { 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
+   { 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
+   { 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+   { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
+   { 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
+   { 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
+   { 0x094D, 0x094D }, { 0x0951, 0x0954 }, { 0x0962, 0x0963 },
+   { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
+   { 0x09CD, 0x0

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Wed, Apr 11, 2007 at 09:00:49PM +0200, Jan Engelhardt wrote:

> >+struct interval {
> >+  int first;
> >+  int last;
> >+};
> 
> CodingStyle? uint16_t instead of int?

> >+{ 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
> >+{ 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
> >+  };
> 
> Since Unicode above 0x is unsupported, could not these entries be killed?

The UTF-8 decoder part already supports full 31-bit Unicode (including 5 and
6 byte long UTF-8 sequences). It's only the font handling part that doesn't
support Unicode beyond BMP. If an application prints a non-BMP character
that is double-wide, or is a zero-width space, the expected behavior is to
move the cursor by two or zero positions. In order to do this, width
information is needed even beyond BMP.  It's a completely different story
that there would be no real glyph displayed, just e.g. a replacement symbol
followed by a space to pretend a real double-width character was printed.

> unsigned int rescan:1;
> unsigned int inverse:1;
> unsigned int width; or even uint8_t.
> I would not mind unsigned.

Okay.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Egmont Koblinger

On Wed, Apr 11, 2007 at 11:36:40AM -0700, H. Peter Anvin wrote:
> Egmont Koblinger wrote:
> >+static int is_zero_width(long ucs)
> >+{
> >+  static const struct interval zero_width[] = {
> /* lots */
> >+  };
> 
> I'm still unhappy about these large search tables in the kernel, not 
> because they take a huge amount of space (it's not that much), but 
> because they're invariably going to be stale, as they're Unicode-version 
> dependent.

I don't think width information for characters in BMP is going to change
that often.

By the way, a note about the size: the larger one of the two tables is
unused and hence optimised away by the compiler. I just left in the source
so that it only takes a minor modification for people go get a different
sane behavior (ie. ignore combining chars). So only the small table, with 11
pairs of longs (88 bytes) are compiled to the kernel.

> At least please put them in a separate .c file and include a script to 
> generate them clean from UnicodeData.txt.

I'll look at it, but I didn't want to alter the building procedure, modify
Makefiles... Or do you mean I should only ship the generated .c file plus
the script, instead of the (1MB) UnicodeData.txt and generating it compile
time? Sounds reasonable...

> Besides, would it not make more sense to have a single table with the 
> width information, if you insist on having one, instead of multiple ones?

I've been thinking on it and I'm not sure which one the right way is. The
reason for choosing this was probably that this way information that is not
used by the code can be omitted by the compiler.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-12 Thread Jan Engelhardt


On Apr 11 2007 19:36, Pavel Machek wrote:
>
>> +  while (max >= min) {
>> +mid = (min + max) / 2;
>> +if (ucs > table[mid].last)
>> +  min = mid + 1;
>> +else if (ucs < table[mid].first)
>> +  max = mid - 1;
>> +else
>> +  return 1;
>> +  }
>> +
>> +  return 0;
>> +}
>
>(Don't we already have rbtrees handling this just fine?)

Static initialization of an rbtre would become a beauty.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-11 Thread Pavel Machek

Hi!

> I hope you like it. :)

Well, more or less... but you need signed-off-by line, and

> @@ -70,6 +70,16 @@
>   * malformed UTF sequences represented as sequences of replacement glyphs,
>   * original codes or '?' as a last resort if replacement glyph is undefined
>   * by Adam Tla/lka <[EMAIL PROTECTED]>, Aug 2006
> + *
> + * More robust UTF-8 decoder. Make it work on malformed sequences as Markus 
> Kuhn's
> + * UTF-8 decoder stress test suggests. Emit a U+FFFD on illegal sequences as 
> well
> + * as for invalid Unicode code points.
> + * If U+FFFD is not available in the font, print an inverse question mark 
> instead.
> + * Display an inverted dot for valid characters that are not available in 
> the font.
> + * Do not print zero-width characters, pad double-width characters with an 
> extra
> + * space so that the cursor moves by zero/two positions in these cases.
> + * 6 April 2007, Egmont Koblinger <[EMAIL PROTECTED]>,
> + * using Markus Kuhn's wcwidth() implementation.
>   */

We no longer put changelogs in code.

> +/* wcwidth() based on the implementation by
> + * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
> + * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
> + */
> +struct interval {
> +  int first;
> +  int last;
> +};
> +
> +static int bisearch(long ucs, const struct interval *table, int max) {
> +  int min = 0;
> +  int mid;
> +
> +  if (ucs < table[0].first || ucs > table[max].last)
> +return 0;

...and you really need to read coding style.

> +  while (max >= min) {
> +mid = (min + max) / 2;
> +if (ucs > table[mid].last)
> +  min = mid + 1;
> +else if (ucs < table[mid].first)
> +  max = mid - 1;
> +else
> +  return 1;
> +  }
> +
> +  return 0;
> +}

(Don't we already have rbtrees handling this just fine?)

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-11 Thread Jan Engelhardt


On Apr 11 2007 20:28, Egmont Koblinger wrote:

>I send a reworked version of the patch.
>
>Removed from the first version:
>  - any sign of '.' as substitute glyph
>  - don't ignore zero-width characters (except for a few zero-width spaces
>that are ignored in the current kernel too). However, I kept the code
>organized and commented so that someone can have the other behavior very
>easily (by removing a pair of comment signs).
>
>Kept features, fixes:
>  - lots of UTF-8 decoder fixes. Emit one U+FFFD for every standalone
>continuation byte and for every incomplete sequence, as Markus Kuhn
>recommends. Reject overlong sequences too.
>  - D800..DFFF and FFFE.. are substituted by FFFD too, since these are
>not valid Unicode code points.
>  - no "random" replacement glyph (e.g. u with double acute instead of
>u with circumflex) in UTF-8 mode
>  - if U+FFFD is not found in the font, the fallback replacement '?' (ascii
>question mark) is printed with inverse color attributes
>  - U+200A was ignored so far as a zero-width space character. I think it
>was a mistake, it's not zero-width.
>  - print an extra space for double-wide characters for the cursor to stand
>at the right place. Yet again the code is organized so that it is very
>easy to change to jump only one character cell, should someone prefer
>that behavior (which I still see no good reason to).
>
>Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>
>
>@@ -1934,6 +1943,99 @@
> char con_buf[CON_BUF_SIZE];
> DECLARE_MUTEX(con_buf_sem);
> 
>+/* is_{zero,double}_width() are based on the wcwidth() implementation by
>+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
>+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>+ */
>+struct interval {
>+  int first;
>+  int last;
>+};

CodingStyle? uint16_t instead of int?

>+static int is_zero_width(long ucs)
>+{
>+  static const struct interval zero_width[] = {
>+{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
[...]
>+{ 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 },
>+{ 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 },
>+{ 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
>+{ 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
>+  };

Since Unicode above 0x is unsupported, could not these entries be killed?

>+static int is_double_width(long ucs)
>+{
>+  static const struct interval double_width[] = {
>+{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
>+{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
>+{ 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 },
>+{ 0x2, 0x2FFFD }, { 0x3, 0x3FFFD }
>+  };

Similarly.

>@@ -1950,6 +2052,10 @@
>   unsigned int currcons;
>   unsigned long draw_from = 0, draw_to = 0;
>   struct vc_data *vc;
>+  unsigned char vc_attr;
>+  int rescan;
unsigned int rescan:1;
>+  int inverse;
unsigned int inverse:1;
>+  int width;
unsigned int width; or even uint8_t.

>   u16 himask, charmask;
>   const unsigned char *orig_buf = NULL;
>   int orig_count;

>@@ -2012,51 +2118,81 @@
>   buf++;
>   n++;
>   count--;
>+  rescan = 0;
>+  inverse = 0;
>+  width = 1;
> 
>   /* Do no translation at all in control states */
>   if (vc->vc_state != ESnormal) {
>   tc = c;
>   } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
>-  /* Combine UTF-8 into Unicode */
>-  /* Malformed sequences as sequences of replacement glyphs */
>+  /* Combine UTF-8 into Unicode in vc_utf_char */
>+  /* vc_utf_count is the number of continuation bytes still 
>expected to arrive */
>+  /* vc_npar is the number of continuation bytes arrived so 
>far */
> rescan_last_byte:
>-  if(c > 0x7f) {
>+  if ((c & 0xc0) == 0x80) {
>+  /* Continuation byte received */
>+  static const int utf8_length_changes[] = { 0x007f, 
>0x07ff, 0x, 0x001f, 0x03ff, 0x7fff };

I would not mind unsigned.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-11 Thread H. Peter Anvin


Egmont Koblinger wrote:

+static int is_zero_width(long ucs)
+{
+  static const struct interval zero_width[] = {

/* lots */

+  };


I'm still unhappy about these large search tables in the kernel, not 
because they take a huge amount of space (it's not that much), but 
because they're invariably going to be stale, as they're Unicode-version 
dependent.


At least please put them in a separate .c file and include a script to 
generate them clean from UnicodeData.txt.


Besides, would it not make more sense to have a single table with the 
width information, if you insist on having one, instead of multiple ones?


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-11 Thread Egmont Koblinger

Hi,

I send a reworked version of the patch.

Removed from the first version:
  - any sign of '.' as substitute glyph
  - don't ignore zero-width characters (except for a few zero-width spaces
that are ignored in the current kernel too). However, I kept the code
organized and commented so that someone can have the other behavior very
easily (by removing a pair of comment signs).

Kept features, fixes:
  - lots of UTF-8 decoder fixes. Emit one U+FFFD for every standalone
continuation byte and for every incomplete sequence, as Markus Kuhn
recommends. Reject overlong sequences too.
  - D800..DFFF and FFFE.. are substituted by FFFD too, since these are
not valid Unicode code points.
  - no "random" replacement glyph (e.g. u with double acute instead of
u with circumflex) in UTF-8 mode
  - if U+FFFD is not found in the font, the fallback replacement '?' (ascii
question mark) is printed with inverse color attributes
  - U+200A was ignored so far as a zero-width space character. I think it
was a mistake, it's not zero-width.
  - print an extra space for double-wide characters for the cursor to stand
at the right place. Yet again the code is organized so that it is very
easy to change to jump only one character cell, should someone prefer
that behavior (which I still see no good reason to).

Signed-off-by: Egmont Koblinger <[EMAIL PROTECTED]>

diff -Naur linux-2.6.20.orig/drivers/char/consolemap.c 
linux-2.6.20/drivers/char/consolemap.c
--- linux-2.6.20.orig/drivers/char/consolemap.c 2007-02-04 19:44:54.0 
+0100
+++ linux-2.6.20/drivers/char/consolemap.c  2007-04-11 18:45:45.0 
+0200
@@ -626,10 +626,10 @@
   
/* Only 16-bit codes supported at this time */
if (ucs > 0x)
-   ucs = 0xfffd;   /* U+FFFD: REPLACEMENT CHARACTER */
-   else if (ucs < 0x20 || ucs >= 0xfffe)
+   return -4;  /* Not found */
+   else if (ucs < 0x20)
return -1;  /* Not a printable character */
-   else if (ucs == 0xfeff || (ucs >= 0x200a && ucs <= 0x200f))
+   else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
return -2;  /* Zero-width space */
/*
 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
diff -Naur linux-2.6.20.orig/drivers/char/vt.c linux-2.6.20/drivers/char/vt.c
--- linux-2.6.20.orig/drivers/char/vt.c 2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20/drivers/char/vt.c  2007-04-11 19:29:19.0 +0200
@@ -70,6 +70,15 @@
  * malformed UTF sequences represented as sequences of replacement glyphs,
  * original codes or '?' as a last resort if replacement glyph is undefined
  * by Adam Tla/lka <[EMAIL PROTECTED]>, Aug 2006
+ *
+ * More robust UTF-8 decoder. Make it work on malformed sequences as Markus 
Kuhn's
+ * UTF-8 decoder stress test suggests. Emit a U+FFFD on illegal sequences 
(including
+ * overlong representations) as well as for invalid Unicode code points.
+ * If U+FFFD is not available in the font, print an inverse question mark 
instead.
+ * Pad double-width characters with an extra space so that the cursor moves by 
two
+ * positions in these cases.
+ * 11 April 2007, Egmont Koblinger <[EMAIL PROTECTED]>,
+ * using code from Markus Kuhn's wcwidth() implementation.
  */
 
 #include 
@@ -1934,6 +1943,99 @@
 char con_buf[CON_BUF_SIZE];
 DECLARE_MUTEX(con_buf_sem);
 
+/* is_{zero,double}_width() are based on the wcwidth() implementation by
+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+struct interval {
+  int first;
+  int last;
+};
+
+static int bisearch(long ucs, const struct interval *table, int max) {
+  int min = 0;
+  int mid;
+
+  if (ucs < table[0].first || ucs > table[max].last)
+return 0;
+  while (max >= min) {
+mid = (min + max) / 2;
+if (ucs > table[mid].last)
+  min = mid + 1;
+else if (ucs < table[mid].first)
+  max = mid - 1;
+else
+  return 1;
+  }
+
+  return 0;
+}
+
+static int is_zero_width(long ucs)
+{
+  static const struct interval zero_width[] = {
+{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
+{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
+{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
+{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
+{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
+{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
+{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
+{ 0x094D, 0x094D }, { 0x0951, 0x0954 }, { 0x0962, 0x0963 },
+{ 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
+{ 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
+{ 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
+{ 0x0A4B, 0x0A4D },

Re: [PATCH] console UTF-8 fixes

2007-04-11 Thread Jan Engelhardt


On Apr 10 2007 20:51, Egmont Koblinger wrote:
>On Tue, Apr 10, 2007 at 10:30:07AM -0700, H. Peter Anvin wrote:
>
>> Really?  Why is CJK so much more fundamental than, say, Arabic?
>
>Not more fundamental at all. It's just perhaps easier to "support" (I mean
>keep track of the cursor, not to really support them of course).
>
>I can't see any reason why these two scripts should be handled identically:
>either support both or none. If it's technically easier to support one and
>harder to support the other, why not implement the first now? Maybe someone
>will implement the other one later.
>
>Oh! Wait a moment! I haven't yet looked at bidi in Unicode, but taking the
>first glimpse it seems to me that U+200E and U+200F control the writing
>direction. Currently the kernel already skips 200E and 200F, doesn't print
>anything, not even a replacement character (see char/consolemap.c). This
>means that RTL is already "supported" at this level: eventually (when RTL
>mode is turned off) the cursor stands where it is expected to stand. In
>between, you either see the right number of replacement symbols, or (if your
>font supports Arabic) you may see the symbols in reverse order. So, after
>all, it's not worse at all than what I want to reach with CJK.

RTL is not supported, and perhaps we should not try. xterm does not do it
either AFAICT.

http://ttyrpld.svn.sourceforge.net/viewvc/ttyrpld/trunk/locale/fa_IR/LC_MESSAGES/ttyrpld.po?revision=15&view=markup
for simple sample data.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread Egmont Koblinger

On Tue, Apr 10, 2007 at 10:30:07AM -0700, H. Peter Anvin wrote:

> Really?  Why is CJK so much more fundamental than, say, Arabic?

Not more fundamental at all. It's just perhaps easier to "support" (I mean
keep track of the cursor, not to really support them of course).

I can't see any reason why these two scripts should be handled identically:
either support both or none. If it's technically easier to support one and
harder to support the other, why not implement the first now? Maybe someone
will implement the other one later.

Oh! Wait a moment! I haven't yet looked at bidi in Unicode, but taking the
first glimpse it seems to me that U+200E and U+200F control the writing
direction. Currently the kernel already skips 200E and 200F, doesn't print
anything, not even a replacement character (see char/consolemap.c). This
means that RTL is already "supported" at this level: eventually (when RTL
mode is turned off) the cursor stands where it is expected to stand. In
between, you either see the right number of replacement symbols, or (if your
font supports Arabic) you may see the symbols in reverse order. So, after
all, it's not worse at all than what I want to reach with CJK.



-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread H. Peter Anvin


Alan Cox wrote:

What do you exactly mean by this? Doing a binary search in a table of 11
intervals to find out whether a character is double-wide? Adding
approximately 30 lines of code (including the table and the binary search
routine) to the kernel to handle this case? I don't think it's bloat. It's a


I don't have a problem with it. It is a situation people find themselves
in and framebuffer consoles can handle CJK although PC text mode ones
can't do it well. It all comes down to a clean and small implementation.


All the CJK framebuffer consoles that handle CJK I've seen run in 
userspace on top of the kernel framebuffer.  Keeping a CJK font in the 
kernel seems prohibitive regardless of the hardware.


Have you seen anything different?

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread Alan Cox

> What do you exactly mean by this? Doing a binary search in a table of 11
> intervals to find out whether a character is double-wide? Adding
> approximately 30 lines of code (including the table and the binary search
> routine) to the kernel to handle this case? I don't think it's bloat. It's a

I don't have a problem with it. It is a situation people find themselves
in and framebuffer consoles can handle CJK although PC text mode ones
can't do it well. It all comes down to a clean and small implementation.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread H. Peter Anvin


Egmont Koblinger wrote:

On Tue, Apr 10, 2007 at 08:43:14AM -0700, H. Peter Anvin wrote:


I don't see the point in dealing with one particular corner case,


I wouldn't really call CJK a *corner* case, just think of how many people
use these writing systems.

Theoretically it's just one particular case, I agree. In practice, however,
this case occurs relatively often compared to all the nontrivial cases a
full-featured terminal should cope with.



Really?  Why is CJK so much more fundamental than, say, Arabic?

-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread Egmont Koblinger

On Tue, Apr 10, 2007 at 08:43:14AM -0700, H. Peter Anvin wrote:

> I don't see the point in dealing with one particular corner case,

I wouldn't really call CJK a *corner* case, just think of how many people
use these writing systems.

Theoretically it's just one particular case, I agree. In practice, however,
this case occurs relatively often compared to all the nontrivial cases a
full-featured terminal should cope with.

> especially a corner case for which font support is inherently impossible.

I'd not make this chance to properly display CJK. I'd make this change to
properly display and to be able to edit English letters that happen to
follow some CJK within the same row.

> It's bloat.

What do you exactly mean by this? Doing a binary search in a table of 11
intervals to find out whether a character is double-wide? Adding
approximately 30 lines of code (including the table and the binary search
routine) to the kernel to handle this case? I don't think it's bloat. It's a
small, simple, easy to understand piece of code that would make the kernel
slightly better, and in some cases would make the users' life easier by
letting some text editors work correctly in one not-so-special case. Why not
then?

If there's no chance to make something perfect then isn't it worth it to
improve that? I don't think so.

I still haven't heard any arguments from you that the resulting *behavior*
would be worse in any respect. I have arguments that it'd be better. It's
only 30 lines of code... And even if it's a bloat, it wouldn't be the first
one in the kernel, would it? :-) A "bloat" that makes it better...

Opinions from anyone else, maybe?

I already have fixes to the UTF-8 decoder, but I still have to adjust that
patch according to what we've already discussed. I really want this CJK
issue to get accepted, unless I'm convinced that it has drawbacks. I am
happy to discuss and argue, but I'm not going to fight (I hope the
difference is clear). So please don't be on my way unless you do have a good
reason. If possible, I don't want to create two incremental patches either
(first for the decoder, second for the double-width) because it does take
some valuable time for me to stress-test each and every patch before I send
it. On the other hand, I don't know what's really needed for a patch to be
accepted. How much your word counts, for example. (I guess it counts much,
being almost the only person with worthful comments on the topic.) How much
do my chances change if I include the CJK part? I don't know...

I'd be very glad if -- based on my arguments -- you changed your mind and
wouldn't be _against_ this feature. A "don't care" state would be perfect
for me :-)

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread H. Peter Anvin


Egmont Koblinger wrote:


I know that correctly handling all Unicode scripts, including CJK, Hebrew,
Arabic, Indic are a much more complicated story and it's way beyond the
scope of kernel. I don't even know whether there's any graphical user-space
application handling all these issues perfectly. So I really don't want to
address them.

I'd like only one small modification: the cursor to jump two columns for CJK
characters instead of just one. (Either two FFFD's, or rather an FFFD
followed by a space printed.) This would allow you to edit English words
within a mixed CJK-English text file. As my experiences show, such a minor
change in the terminal driver would solve cursor aligning issues in *many*
cases. With this change the console would still be very-very far from being
perfect, it just would be simply better in practice.



I don't see the point in dealing with one particular corner case, 
especially a corner case for which font support is inherently impossible.



Are you still definitely against this change? I see no drawbacks this could
cause, while it would make the console better in some circumstances. I think
this is just a small step towards a better console driver.


It's bloat.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-10 Thread Egmont Koblinger

On Sat, Apr 07, 2007 at 10:59:19AM -0700, H. Peter Anvin wrote:

> As far as width handling -- in order to make all the text line up under 
> all circumstances you need more than width handling.  [...]
> 
> is is ridiculous.  It's much better to draw a line in the sand and say 
> that this is beyond the scope of the in-kernel Linux console.

Hi,

It seems this is the only detail we don't yet agree on and I'd like to come
to an agreement before I update the patch.

I know that correctly handling all Unicode scripts, including CJK, Hebrew,
Arabic, Indic are a much more complicated story and it's way beyond the
scope of kernel. I don't even know whether there's any graphical user-space
application handling all these issues perfectly. So I really don't want to
address them.

I'd like only one small modification: the cursor to jump two columns for CJK
characters instead of just one. (Either two FFFD's, or rather an FFFD
followed by a space printed.) This would allow you to edit English words
within a mixed CJK-English text file. As my experiences show, such a minor
change in the terminal driver would solve cursor aligning issues in *many*
cases. With this change the console would still be very-very far from being
perfect, it just would be simply better in practice.

(Also note that handling two-column characters is probably amongst the first
things a good text editor implements, while handling RTL and other stuff are
trickier for applications too. I guess there are a plenty of text editors
that handle CJK correctly (provided that the underlying terminal emulator
does so too) but have troubles with combining accents, bidi and other more
complicated stuff.)

Are you still definitely against this change? I see no drawbacks this could
cause, while it would make the console better in some circumstances. I think
this is just a small step towards a better console driver.

-- 
Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread H. Peter Anvin


Egmont Koblinger wrote:

On Sat, Apr 07, 2007 at 01:00:48PM +0200, Jan Engelhardt wrote:

Hi,


Please, no dot, and no inverse color.
Imagine someone had the following bitmap for :


No dot, I'm already convinced. To clarify the inverse thingy:

This is what the current kernel does:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD (which usually looks similar to an
 inverted question mark)
  3) if this fails again then displays a normal '?'
 (or a different symbol due to a bug discussed below)

Here's my proposal. This only alters the 3rd step, not the first two:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD, still with _normal_ attributes
  3) if this fails then display an ascii '?' with inverted attributes

So you won't get "double" inversion. If you do have U+FFFD in your font then
this will introduce no chance. If you don't have U+FFFD, you'll see inverse
question marks instead of normal ones.



This seems fine.




I blame your latin2 unicode map. (See above about 'Û'.)


There's nothing wrong with my latin2 unicode map, and I've located and
changed the part _in the kernel_ that displays a false glyph using the
algorithm I've outlined. It just uses "the glyph at that code position
within the glyph table" as a fallback, which might be okay in 8-bit mode
(and I haven't modified the behavior in that case), but I got rid of this
behavior in UTF-8 mode since it's definitely a fault in the world of
Unicode.


It should perhaps display a regular 'u' if it cannot display 'û',


I rather think it should display U+FFFD but YMMV.


That's a policy decision for the maker of the Unicode map.  The kernel 
cannot by default know that a pre-composed ű is a modified u; obviously, 
if the ű is send in decomposed form the kernel probably will display it 
as u? or some such.



but definitely not 'ü' (which is not called a double accent, btw).


This is not the character I've been talking about, I actually _did_ talk
about u with double acute accent (ű - you might not have seen this character
so far, AFAIK it's only used in Hungarian, no other languages). But we agree
that the kernel definitely shouldn't display a character with a different
accent on it. This is one of the bugs my patch addresses.


As far as width handling -- in order to make all the text line up under 
all circumstances you need more than width handling.  The wcwidth() 
stuff is specific to CJK -- a character set which is totally implausible 
to display on the builtin console.  You also need bidir support (in case 
you encounter Hebrew or Arabic), you need Indic shape handling (Indic 
langauges have some *very* odd composing rules), etc, and this is just 
to know how much space to take up on the screen.


is is ridiculous.  It's much better to draw a line in the sand and say 
that this is beyond the scope of the in-kernel Linux console.


-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Egmont Koblinger

On Sat, Apr 07, 2007 at 01:00:48PM +0200, Jan Engelhardt wrote:

Hi,

> Please, no dot, and no inverse color.
> Imagine someone had the following bitmap for :

No dot, I'm already convinced. To clarify the inverse thingy:

This is what the current kernel does:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD (which usually looks similar to an
 inverted question mark)
  3) if this fails again then displays a normal '?'
 (or a different symbol due to a bug discussed below)

Here's my proposal. This only alters the 3rd step, not the first two:
  1) tries to display the desired symbol
  2) if it fails, tries to display U+FFFD, still with _normal_ attributes
  3) if this fails then display an ascii '?' with inverted attributes

So you won't get "double" inversion. If you do have U+FFFD in your font then
this will introduce no chance. If you don't have U+FFFD, you'll see inverse
question marks instead of normal ones.

> I blame your latin2 unicode map. (See above about 'Û'.)

There's nothing wrong with my latin2 unicode map, and I've located and
changed the part _in the kernel_ that displays a false glyph using the
algorithm I've outlined. It just uses "the glyph at that code position
within the glyph table" as a fallback, which might be okay in 8-bit mode
(and I haven't modified the behavior in that case), but I got rid of this
behavior in UTF-8 mode since it's definitely a fault in the world of
Unicode.

> It should perhaps display a regular 'u' if it cannot display 'û',

I rather think it should display U+FFFD but YMMV.

> but definitely not 'ü' (which is not called a double accent, btw).

This is not the character I've been talking about, I actually _did_ talk
about u with double acute accent (ű - you might not have seen this character
so far, AFAIK it's only used in Hungarian, no other languages). But we agree
that the kernel definitely shouldn't display a character with a different
accent on it. This is one of the bugs my patch addresses.

bye,

Egmont
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Jan Engelhardt

Hi,


I just wanted to give my opinion on things...

(and enable utf8 to read this properly)

On Apr 7 2007 11:24, Egmont Koblinger wrote:
>
>> I strongly disagree.  First of all, you're changing the semantics of a 
>> 13-year-old API.  The semantics of the Linux console is that by 
>> specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
>> specified the fallback glyph.
>
>OK, I'm not against using U+FFFD for missing glyphs. In the mean time I
>think it's still a good idea to clearly separate the two cases in the code
>(that is, the case of invalid sequence from the case of missing glyph), but
>we can still use the same replacement character in these two cases. I'll
>send an updated patch after Easter if it sounds good for you.

I am quite ok with the way things are right now.

 - vc displays  for illegal sequences

 - vc displays e.g. "U" (latin capital U) in place when Û (latin capital
   U with accent circumflex) is not available in this font 
   (determined by the unicodemap) (I do use an unicode map, because I
   use a 4096-byte cp437 "DOS" font which requires one)

 - vc displays  for sequences it does not know how to print

 - xterm displays  for illegal sequences

 - xterm seems to display  on undefined glyphs (U+DFFF for ex.,
   using the "Unicode Best" font from the xterm menu)

 - xterm seems to display nothing on undefined glyphs (U+E000 for ex.,
   "Unicode Best" again)

>> What's worse, you've hard-coded the uses of specific visual 
>> representations.  That is completely unacceptable.
>
>Now that we've dropped the idea of "dot" for missing glyphs, the other thing
>
>[...]
>
>Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I
>choose for fallback is still '?' (the ASCII question mark), I just invert
>the color attributes of the cell where this is printed. This way it becomes
>visually distinguisable from the literal question mark. Using the current
>kernel you just cannot know whether the character printed is a real question
>mark, or a replacement glyph. Still, should you stongly disagree with this
>decision, the color inverting part can easily be removed.

Please, no dot, and no inverse color.
Imagine someone had the following bitmap for :





####
####

####
####

####
####





Then inverting that again would be susceptible to confusion with
the regular '?' at 0x3F. 

(cp437 for example maps unknown/illegal to 0xFD which happens to be the
block graphic '■', but YMMV depending on font.)

>I think I've (mostly) described it above. Set everything to UTF-8, load a
>latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an
>application print U+00FB (alt + numpad 251 is one trivial way), you'll see
>an "u with double accent", though the symbol to be displayed is "u with
>circumflex". This isn't present in the current font, so the replacement
>character should appear, not a different letter.

I blame your latin2 unicode map. (See above about 'Û'.)
It should perhaps display a regular 'u' if it cannot display 'û',
but definitely not 'ü' (which is not called a double accent, btw).

>> To be able to do CJK you need something like Kon anyway.  This feels 
>> like bloat.
>
>I don't want CJK support. All that I want is to be able to edit English
>words within a file that contains mixture of English and CJK, with a text
>editor like vim or joe.

+1 for this one :)

xterm## echo "韓国と日本にようこそ!" >/tmp/foobar.txt
vc## cat foobar.txt

currently gets things not so right, because multibyte characters are not
displayed with as many  as they are wide.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] console UTF-8 fixes

2007-04-07 Thread Egmont Koblinger

On Fri, Apr 06, 2007 at 12:43:03PM -0700, H. Peter Anvin wrote:

Hi,

> I strongly disagree.  First of all, you're changing the semantics of a 
> 13-year-old API.  The semantics of the Linux console is that by 
> specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
> specified the fallback glyph.

OK, I'm not against using U+FFFD for missing glyphs. In the mean time I
think it's still a good idea to clearly separate the two cases in the code
(that is, the case of invalid sequence from the case of missing glyph), but
we can still use the same replacement character in these two cases. I'll
send an updated patch after Easter if it sounds good for you.

> What's worse, you've hard-coded the uses of specific visual 
> representations.  That is completely unacceptable.

Now that we've dropped the idea of "dot" for missing glyphs, the other thing
that remains is the hard-coded '?' if and only if U+FFFD is not present in
the font. It is even hardcoded in the current code and I have no better
idea, there must be a last-resort hardcoded fallback. The only thing I
changed is that I inverted the color attributes for this question mark. Do
you think that the old behavior, a normal question mark would be better? No
problem, I'll adjust the code in this case. Just please tell me what the
expected behavior is, I'm not sure I clearly understand your thoughts.

> >  - Another possible thing the current code may do (for latin1-compatible
> >characters) is to simply display the glyph loaded in that position.
> >Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
> >double accent". An applications prints U+00FB, which is an "u with
> >circumflex". Since this glyph is not present in latin2, it cannot be
> >printed with the current font. Still, the current code falls back to
> >printing the glyph from the 0xFB position of the glyph table. Hence my
> >app asked to print "u with circumflex" but an "u with double accent"
> >appears on the screen. This is totally contrary to the goals of Unicode
> >and shouldn't ever happen.
> 
> When does that happen?  That is clearly a bug.

I think I've (mostly) described it above. Set everything to UTF-8, load a
latin2 font (containing 256 glyphs, e.g. "setfont lat2-16"), make an
application print U+00FB (alt + numpad 251 is one trivial way), you'll see
an "u with double accent", though the symbol to be displayed is "u with
circumflex". This isn't present in the current font, so the replacement
character should appear, not a different letter.

> >- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
> >  back to a question mark. I've changed the fallback version to an inverted
> >  question mark. This way it's more similar to the common glyph of U+FFFD,
> >  and it's more trivial to the user that it's not a literal question mark
> >  but rather some erroneous situation.
> 
> Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
> all fonts.  The whole point of falling back to ? is that it's an ASCII 
> character, which means that if the font designer failed to designate a 
> fallback glyph -- which is an error!!! -- there is at least some hope of 
> conveying the error back to the user.

Sorry, I wasn't clear enough and I think you misunderstood me. The symbol I
choose for fallback is still '?' (the ASCII question mark), I just invert
the color attributes of the cell where this is printed. This way it becomes
visually distinguisable from the literal question mark. Using the current
kernel you just cannot know whether the character printed is a real question
mark, or a replacement glyph. Still, should you stongly disagree with this
decision, the color inverting part can easily be removed.

> >- There's no concept of double-width characters. It's way beyond the scope
> >  of my patch to try to display them, but at least I think it's important
> >  for the cursor to jump two positions when printing such characters, since
> >  this is what applications (such as text editors) expect. If the cursor
> >  didn't jump two positions, applications would suffer from displaying and
> >  refreshing problems, and editing some English letters that are preceded 
> >  by
> >  some CJK characters in the same line became a nightmare. With my patch an
> >  inverted dot followed by an inverted space is displayed for double-width
> >  characters so it's quite easy to see that they are tied together.
> 
> To be able to do CJK you need something like Kon anyway.  This feels 
> like bloat.

I don't want CJK support. All that I want is to be able to edit English
words within a file that contains mixture of English and CJK, with a text
editor like vim or joe. Just try it with the current kernel, and with my
patch. Suppose that within a line some CJK text is followed by an English
word, and you want to edit the latter one. It's going to be a huge headache
with the current kernel. Where you see the cursor is not wh

Re: [PATCH] console UTF-8 fixes

2007-04-06 Thread H. Peter Anvin


Egmont Koblinger wrote:


- If a certain (otherwise valid UTF-8) character is not found in the glyph
  table, the current code does one of these two (depending on other
  circumstances):

  - Either it displays the replacement character U+FFFD, falling back to a
simple question mark. Note that the Unicode replacement character U+FFFD
is to be used for invalid sequences. However, it shouldn't necessarily
be used when replacing a valid but undisplayable character. Think of
Pango for example that renders these as four hex digits inside a square.
To be able to visually distinguish between illegal sequences and legal
but undisplayable characters, I think U+FFFD or the question mark are
bad choices. In fact, any symbol that may normally occur in the text is
a bad choice if is displayed simply. Hence I chose to display an
inverted dot.



I strongly disagree.  First of all, you're changing the semantics of a 
13-year-old API.  The semantics of the Linux console is that by 
specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
specified the fallback glyph.


What's worse, you've hard-coded the uses of specific visual 
representations.  That is completely unacceptable.



  - Another possible thing the current code may do (for latin1-compatible
characters) is to simply display the glyph loaded in that position.
Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
double accent". An applications prints U+00FB, which is an "u with
circumflex". Since this glyph is not present in latin2, it cannot be
printed with the current font. Still, the current code falls back to
printing the glyph from the 0xFB position of the glyph table. Hence my
app asked to print "u with circumflex" but an "u with double accent"
appears on the screen. This is totally contrary to the goals of Unicode
and shouldn't ever happen.


When does that happen?  That is clearly a bug.


- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
  back to a question mark. I've changed the fallback version to an inverted
  question mark. This way it's more similar to the common glyph of U+FFFD,
  and it's more trivial to the user that it's not a literal question mark
  but rather some erroneous situation.


Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
all fonts.  The whole point of falling back to ? is that it's an ASCII 
character, which means that if the font designer failed to designate a 
fallback glyph -- which is an error!!! -- there is at least some hope of 
conveying the error back to the user.



- Overlong sequences are not caught currently, they're displayed as if these
  were valid representations. This may even have security impacts.

- Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
  currently displayed as some "random" glyphs rather than the replacement
  character.

- Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
  but rather cause the subsequent valid character to be displayed more
  times(!).


These are valid issues.


- There's no concept of double-width characters. It's way beyond the scope
  of my patch to try to display them, but at least I think it's important
  for the cursor to jump two positions when printing such characters, since
  this is what applications (such as text editors) expect. If the cursor
  didn't jump two positions, applications would suffer from displaying and
  refreshing problems, and editing some English letters that are preceded by
  some CJK characters in the same line became a nightmare. With my patch an
  inverted dot followed by an inverted space is displayed for double-width
  characters so it's quite easy to see that they are tied together.


To be able to do CJK you need something like Kon anyway.  This feels 
like bloat.



- There's no concept of zero-width characters (such as combining accents)
  either. Yet again it's beyond the scope of my patch to properly handle
  them. Instead of the current behavior (write a replacement character) I
  just ignore them so that full-screen applications can keep track of the
  cursor position correctly.


There is a concept of combining sequences.  Anything else, I suspect 
it's better to let the user know that something bad is happening.



- I believe (at least I do hope) that my code is cleaner, more
  straightforward, easier to understand, and is slightly better documented
  than the current version. The current code doesn't separate UTF-8 decoding
  and glyph displaying parts. I clearly separated them. First I perform
  UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
  the width of the resulting character, change it to U+FFFD if it's
  unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
  does its best in displaying the character on the screen.

I hope you like it. :)


Please see above comments.

-hpa
-
To unsu

[PATCH] console UTF-8 fixes

2007-04-06 Thread Egmont Koblinger

Hi folks,

I send a patch to the UTF-8 part of the vt driver. I know that this code has
recently been updated to be better than it had been previously, but it still
suffers from plenty of bugs. My patch addresses all the issues known by me.
The difference in the behavior can easily be seen by trying either Markus
Kuhn's UTF-8 stress-test, or editing a multi-lingual file (such as
applications' .desktop files), I believe that my re-write does not only fix
several bugs but also yields better usability of the console as well as
cleaner source code.

Here is the description of the bugs or misfeatures that I fix:

- If a certain (otherwise valid UTF-8) character is not found in the glyph
  table, the current code does one of these two (depending on other
  circumstances):

  - Either it displays the replacement character U+FFFD, falling back to a
simple question mark. Note that the Unicode replacement character U+FFFD
is to be used for invalid sequences. However, it shouldn't necessarily
be used when replacing a valid but undisplayable character. Think of
Pango for example that renders these as four hex digits inside a square.
To be able to visually distinguish between illegal sequences and legal
but undisplayable characters, I think U+FFFD or the question mark are
bad choices. In fact, any symbol that may normally occur in the text is
a bad choice if is displayed simply. Hence I chose to display an
inverted dot.

  - Another possible thing the current code may do (for latin1-compatible
characters) is to simply display the glyph loaded in that position.
Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
double accent". An applications prints U+00FB, which is an "u with
circumflex". Since this glyph is not present in latin2, it cannot be
printed with the current font. Still, the current code falls back to
printing the glyph from the 0xFB position of the glyph table. Hence my
app asked to print "u with circumflex" but an "u with double accent"
appears on the screen. This is totally contrary to the goals of Unicode
and shouldn't ever happen.

- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
  back to a question mark. I've changed the fallback version to an inverted
  question mark. This way it's more similar to the common glyph of U+FFFD,
  and it's more trivial to the user that it's not a literal question mark
  but rather some erroneous situation.

- Overlong sequences are not caught currently, they're displayed as if these
  were valid representations. This may even have security impacts.

- Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
  currently displayed as some "random" glyphs rather than the replacement
  character.

- Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
  but rather cause the subsequent valid character to be displayed more
  times(!).

- There's no concept of double-width characters. It's way beyond the scope
  of my patch to try to display them, but at least I think it's important
  for the cursor to jump two positions when printing such characters, since
  this is what applications (such as text editors) expect. If the cursor
  didn't jump two positions, applications would suffer from displaying and
  refreshing problems, and editing some English letters that are preceded by
  some CJK characters in the same line became a nightmare. With my patch an
  inverted dot followed by an inverted space is displayed for double-width
  characters so it's quite easy to see that they are tied together.

- There's no concept of zero-width characters (such as combining accents)
  either. Yet again it's beyond the scope of my patch to properly handle
  them. Instead of the current behavior (write a replacement character) I
  just ignore them so that full-screen applications can keep track of the
  cursor position correctly.

- I believe (at least I do hope) that my code is cleaner, more
  straightforward, easier to understand, and is slightly better documented
  than the current version. The current code doesn't separate UTF-8 decoding
  and glyph displaying parts. I clearly separated them. First I perform
  UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
  the width of the resulting character, change it to U+FFFD if it's
  unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
  does its best in displaying the character on the screen.

I hope you like it. :)


cheers,

Egmont





diff -Naur linux-2.6.20.orig/drivers/char/consolemap.c 
linux-2.6.20/drivers/char/consolemap.c
--- linux-2.6.20.orig/drivers/char/consolemap.c 2007-02-04 19:44:54.0 
+0100
+++ linux-2.6.20/drivers/char/consolemap.c  2007-04-06 17:28:05.0 
+0200
@@ -627,10 +627,8 @@
/* Only 16-bit codes supported at this time */
if (ucs > 0x)
ucs = 0xfffd;   /* U+FFFD: REPLACEMENT CHA

43 matches

Mail list logo