Re: Unicode Keyboard Input Linux

Pablo Saratxaga Mon, 14 Jun 2004 12:34:49 -0700

Kaixo!

On Mon, Jun 14, 2004 at 08:43:38AM -0700, Elvis Presley wrote:


> Unicode Keyboard Input Linux

In fact unicode (trough utf-8 of course) mostly works on the console.
The drawbacks are currently tied to the nature of the console (in
the current text mode) and not to the encoding.

The main drawbacks are:
- display is limited to up to 512 different glyphs; it is enough for
  most alphabetic languages; but it is not enough for CJK languages,
  for example.
- display is limited to 1 char=1 glyph=1 cell paradigm; that means
  languages like Thai, where a suite of chars can have their glyphs
  stacked one up the other in a single cell will display horribly;
  languages needing glyph recomposition like those using indic alphabets
  are simply impossible.
  Note that even some languages using latin alphabet are hurt, as they
  use some accented letters not present in unicode which are encoded
  as base letter and composing accent.

THe difference whith xterm-like terminals here is very huge; on X11
powerfull font functions are available, and there are text terminals
that are able to nicely display scripts where 1 char is not necessarly
equal to 1 glyph and not necessarly equal to 1 cell; and you are not
limited to number of glyphs, so you can write in chinese without
problem.
Plus, the resolution is much better, and the range of available and
choosable fonts much, much, much wider.

There are also input problems in console.
Typing directly unicode chars (with 1 keystroke = 1 char) is not a
problem at all (it is just tedious to write the keymaps, and if you want
to support both utf-8 and one or several old encodings, you have to
provide a different keyboard file for each encoding; that is very bad,
it would be much better to be able to have a single keyb description
file, in unicode, and just tell to loadkeys the character set wanted
(the default being whatever the glibc says is the default for the
current locale).

For composing however it's bad; kernel composing tables use "char",
and so it is not possible to properly use dead keys or compose key
while in unicode in the console (if you compose only chars also 
in the iso-8859-1 character set, it more or less work, you just have to
type an extra keystrke, which is lost in outer space; but I doubt it
will work for other chars, I suspect the fact it "mostly work" for
some chars is because their iso-8859-1 8 bit code is the same numeric
value as their unicode code). 

For languages needed help of an input method, the console is mostly
unusable (it would be very nice to be able to have a single input method
backend usable on both the console and X11; but so fat I know of none
that does that and that is usable and widely used).

Input works (almost) perfectly on X11 (the problem is due to the input
framework of XFree86 that doesn't allow to switch input methods; so
you cannot type some words in korean, then switch to chinese input...
but some programs have started to bypass it, and xorg seems to use
an input framework that solves that long standing annoyance).

And output works on X11.

So it could be a good think to have the engine of a good xterm-like
terminal be used for the console, of course removing any unneeded
linking to X11 libs; and it would solve a lot of things.
Of course it would only work on screens with graphical capabilities,
not on real vt100, French minitels or hp48 screens; but nobody is
expecting to be able to write in devanagary in such devices I think.

> The real console is essentially a graphical device,

Not always.
Not on some local screens on old PCs (it has always been a graphical
device on locale screens for all non-PCs ports of linux; but for the PC
itself the text mode in the local screen as graphical device is
something quite new (you can look at when the "framebuffer" appeared 
on the i386 branch of linux to see the exact date).
Also, you can redirect the console to another device than a local screen
(again, it was there first on non-PCs branchs, I think the SUN ports
were first; on PC you can redirect the console to a serial port)

In fact, whether the console is physically a graphical device or not,
for the operating system it is not; it is just text.
That doesn't mean there couldn't been a graphical device, nor that
such device couldn't be used for the console, nor that such graphical
console couldn't do nice graphical things with text, like it is done
on modern xterms on X11.
But that is not done trough the normal I/O channels; programs see the
console just as a text device, and send text flows, with some control
codes to place cursor, change color, etc; but there is no way to
play with individual pixels at the console I/O API for example.

> with screen(=display), keyboard and mouse, and
> whatever else might be considered interesting...
> Applications do not open the real console directly,
> but in theory, they could --in DOS they could: the
> interface could be made public; there would have to be
> a device special file for the real console, and the
> virtual consoles too, and the pseudo terminals... Have
> I forgotten anything?

It seems you are calling "console" what I would call "framebuffer".
For me "console" is the system that allows the kernel to display
text and get output locally; the /dev/console 


> You could not really use keymaps in a traditional tty
> configuration anyway, because the ascii terminal can't
> display unicode characters,

keymaps and ascii-only are irrelevant.
You can writte in unaccented French or German in ascii-only (it is ugly,
oit s bad, but it is possible) and yet want to use a French
or German keyboard layout.

Or simply you want to write in English in ascii only, but you
like dvorak layout...

> Of course, the tty module still must understand
> unicode. I don't think this is a big problem, beacuse
> the basic repetoire remains the same (=ascii) thanks
> to the utf-8 encoding, but I'm sure there a few hidden
> traps.

A lot.

ascii does a lot of assumptions that are simply false in utf-8:
1. one char = 1 byte (that is false in utf-8 after U+007F)
2. one char = 1 cell (that is false, see combining diacritics etc)
3. one char = glyph (that is false again, arabic char "noon" has
        4 different looking shapes depending on what comes after
        and before it)
4. text is written left to right (that is false for several scripts
        and languages; some scripts are even truly bidirectinal;
        and there are even scripts written vertically only (not CJK
        which can also be written horiezontally) and are currenlty
        completly unsupported (but encoded in unicode)
5. Del and Bacskpace are similar (false again, a Del removes the content
        of a cell, which can be several chars (and one char can be
        several bytes too); while Backspace only removes one char)
6. text selection in bidirectional environment is hairy

etc.

of course, even a minimalistic utf-8 support is better than nothing;
but saying "understand unicode" is a misleading thing; there are various
different levels of understanding possible, and various different levels
of support possible; things aren't as simple as with ascii.
 
> Anything (module or program) which opens the master
> side of a pseudo-terminal is called a terminal
> emulator, therefore a 'vc' and an 'xterm' perform the
> same function, but in different spaces. I wonder how
> much of the software can be reused. You need vc's in
> the kernel in the absence of X, to support Linux
> virtual terminals.

No, you don't.
It would be perfectly ok to provide only very minimalistic kernel
support (even simpler and lighter than the current one) and
have a user space 'vc' loaded early in the boot process.

In fact it would be much saner.

[...]

> Comparing characters would be easy, they compare as
> unsigned integers, but sorting them would be a
> problem, because you'd want to group all the
> (accented) vowels together, according to language
> specific rules.

That is not new to unicode, it was already the case with
other encodings, including ascii.

And it is completly irrelevant of console/terminal anyway.

> In Greek, this wouldn't be a problem,
> because monotonic vowels and polytonic vowels,

No, it's not a problem if the proper sorting rules are used
(you choose them with the LC_COLLATE variable).

I don't know how accurate the sorting of polytonic letters is with
currently used greek locales; but that is easy to fix anyway; the
problem is not technical at all.

> The editor 'vi' would have to be modified to get/put
> wcar_t, so I don't understand why you'd need a
> separate unicode editor, or separate unicode
> application, whatever it might be.

The problem with text editors is the same as with command line
editing: cursor positioning and character deletion.
With unicode you cannot asume anymore that 1 char = 1 byte = 1 cell.

Editors assuming 1 byte = 1 char are irremediably broken anyway; and
completly unusable (the cursor displays at a completly different
place from where it really is!);
character selection and deletion is also complicated by the fact that
1 cell can be made of several characters.
And there is bidirectionality problems as well.

So editors that are deficient in some of those aspects need to be
fixed; or replaced with other editors able to do the job.

plain "vi" is fine to handle raw ascii, but useless to edit real human
text in utf-8.
(vim on the other hand is decent)

> 1) Does 'sort' work on utf-8 input?

yes.

> 2) Does 'grep' (Unix search) work on utf-8 input?

yes.

> 3) Is there a laundry list or Unix filters which need
> to be changed to support Internationalization? I know
> 'cat' doesn't.

I don't know.

> Why do Greek newspapers still use ISO 8859-7?

For the same reason that a majority of English language web sites
still use windows-1252, I suppose.

> Since utf-8 doubles the size of a file,

It doesn't; it depends of the text;
but anyway, even at the worst case, the increase in size for text is
ridiculous compared to the huge size taken by images, sound, video,
etc.

> it looks like
> these older character sets will be around for a long
> time.

Yes, but not for that reason; they are around because there is a
lot of *OLD* data in those encodings, and it needs to be supported.

But charset encoding is, for a majority of end user, a moot point,
they simply don't care, nor do they even know what encoding is used;
they just see text on screen, that's all; it is the program
that does any charset conversion for them, if needed.

Note that nowadays, a majority of programs have already switched
to use unicode internally.

> Unicode is a much nicer solution, except it's
> prejudiced against non-english speakers.

??? It's exactly the opposite! Unicode is of all existing
charset encodings the only one that is not prejudiced
against any particular language.

> All tags are
> ascii, but the content can be anything, just switch
> keymaps, no need to tag the content again. However,
> double the size of the file and you double the
> download time too. Now you need a server twice as big.

I suggest you look at the size of your html files and you
image/sound/etc images on your typical web server; even if doubling
the size of the html files, the percent icnrease in total is small.
And you don't double the size, there is a lot of html tagging in
ascii that just doesn't change.
In fact, in some cases the size may decrease, if you replace
a bunch of ugly &html;&mar;&ent;&iti;&es;&nbsp;&ug;&ly; &as;&hell;
with real and readable utf-8 characters.

Note also that the same "size increase" argument was present 
when 7bit encodings (like ascii or koi7) had been replaced with 8bit
ones (like cp1252, iso-8859-7 or koi8-r), yet the new 8bit encodings
were overwhilmingly used, simply because the 7bit only was too
limitative.

> It looks to me like the most important distinction
> between locales is not language, but national currency
> symbol.

Not for countries using euro currency :)

differences are a combination of both language and national preferences.
some things are very largely on the side of language difference;
like sorting order (LC_COLLATE), uppercase/lowercase changing,
definition of what is a letters, etc (LC_CTYPE), or the
date format (LC_TIME); other are more influenced on political
boundaries, like monetary conventions (LC_MONETARY), paper size,
or telephone number notation.

> 1) What are utf-8 locales? I would have thought that
> utf-8 would be applicable across all locales.

No; each locale defines an encoding.

en_US.ISO-8859-1 and en_US.UTF-8 are not the same
 
> Hypothesis: There could be an iso 8859 locale and a
> unicode locale for the same "region" for historical
> reasons.

Yes; the simple fact that non utf-8 encodigns existed (and still exist)
makes it necessary to recongnize them.

> This is causing the confusion. I've never
> worked in Latin-1, or Latin-2, just ascii and unicode,

I very much doubt you worked in "just ascii" (maybe in 1969; but clearly
not in the 1980s, much less in the 1990s);
most probably you were using one of cp437 (in DOS) or iso-8859-1 (in
unix)

> and I don't even want to think about using a different
> copy of the same program for each.

??

There is no confusion; nor need for different copies.

The situation is acutally quite simple: a program either is
internationalized, or it is not.
If it is not, it is broken, are doomed to die as people stop using it.
If it is internationalized, then the character encoding it will use
for I/O will be transparent to the user, it will follow the locale
a work smoothly.

> Now, what about left->right and double-column
> characters?

And don't forget the zero-column ones.

> Can I run a copy of X windows in an xterm?

an xterm is a text terminal only.
 
> Is there a version of X which runs as a Microsoft
> Window

A lot of them actually.
I once use XWin32 to launch X clients from a unix box and display
them on the screen of a win95 machine that had a much better screen
and graphical card.


> Is there a version of Linux which runs as a Microsoft
> Window (not cygwin)?

?? What you say doesn't make sense.
(you can on the other hand run an operating system inside of
a virtual computer box inside another operating system)

-- 
Ki �a vos v�ye b�n,
Pablo Saratxaga

http://chanae.walon.org/pablo/          PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Catalan or Esperanto]
[min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]

pgp0ECoi4D93o.pgp
Description: PGP signature

Re: Unicode Keyboard Input Linux

Reply via email to