Markus,
First of all, POSIX, XPG, ANSI C are being consolidated into a single
specification at the next release of TOG/IEEE/ISO SUS and so my use of POSIX
isn't so incorrect. Also, if you need another evidence, see base definitions
volume of IEEE 1003.1-200x, <wchar.h>.
For your argument on A, well, most of commercial Unix variants I believe
at least support EUC by downloading width information of the current
locale/codeset which is usually about 6 bytes in size to ldterm kernel module
through ioctl(2) so that for canonical input mode (and shells that are relying
on the canonical input mode) can do erase/kill operation correctly.
In these days, I am sure that the ldterm not only supports EUC but also
many other PC codepage based multibyte codesets/locales and also UTF-8, well,
at least Solaris does support all these codesets in canonical input mode.
I also saw your implementation came with xterm at XF86 4.0.2. I think it's
a good implementation but I dont't think it is achieving the best possible
performance since you are doing binary search and has many if expressions.
Yes, it's a trade off between space vs. speed and no one code is absolutely
better but for some cost of memory space that I mentioned, the only things
that you need to do to get the width of an UTF-8 character is about two
arithmetic operations and a few branch operations like below:
if (wc) {
i = wc / 4;
j = wc % 4;
switch (j) {
case 0:
return (width_tbl[ucode[i].u0]);
case 1:
return (width_tbl[ucode[i].u1]);
case 2:
return (width_tbl[ucode[i].u2]);
case 3:
return (width_tbl[ucode[i].u3]);
}
}
Since I relied in the previous email why do we also need the proper line
discpline for multibyte codesets at the ldterm kernel module,
I'll not reply in this email on your argument on "it should only on
user space clients."
If you want to see whole function by the way, please download sources of
Solaris 8.
With regards,
Ienup
] X-URL: http://www.cl.cam.ac.uk/~mgk25/
] Date: Thu, 25 Jan 2001 11:27:24 +0000
] From: Markus Kuhn <[EMAIL PROTECTED]>
] Subject: Re: kernel tty patches
] To: [EMAIL PROTECTED]
] MIME-version: 1.0
]
] Ienup Sung wrote on 2001-01-24 22:23 UTC:
] > The wcwidth(3C) of POSIX requires to return, in general sense, four
] > distingtive values, 0, 1, 2, and -1.
]
] There is no wcwidth in POSIX. It is found in X/Open SUS
]
] http://www.opengroup.org/onlinepubs/007908799/xsh/wcwidth.html
]
] and it is was mentioned in an informal appendix of ISO C 90 Amd. 1 but
] is not part of the C standard (was in a draft, but removed from the
] final version).
]
] > Hence we need only two bits per
] > each character. Which means for a plane, you will need 2 x 65536 bits
] > and that is 16KB for a plane if you want to have a faster algorithm for
] > width calculation and thus want to have the width values available for
] > all Unicode characters in the plane. This will require that the kernel will
] > eventually need to have 16KB x 17 planes = 272KB to represent all possible
] > Unicode characters. As a starter, though, we will only need 16KB x 3 = 48KB
] > of memory space if we count only BMP, SMP, and SIP planes.
] >
] > Further compaction of memory space is possible if necessary of course.
]
] A) A classic Unix kernel doesn't need wcwidth at all, because the kernel
] is ignorant of absolute kernel positions and only uses LF and BS for
] cursor control.
]
] B) My reference implementation of wcwidth on
]
] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
]
] is 2.5 kilobytes long (compiled code plus data tables covering all UCS
] planes), and in this form more then sufficiently efficient for editors.
]
] Adding wcwidth to the kernel would be rather trivial, but if we have to
] add it, then only to the terminal emulator, not to the tty editor.
]
] Markus
]
] --
] Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
] Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
]
] -
] Linux-UTF8: i18n of Linux on all levels
] Archive: http://mail.nl.linux.org/lists/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/