Re: Converting to UTF-8 from ISO-8859

Vineet Kumar Mon, 16 Jun 2003 00:42:04 -0700

* Alex Malinovich ([EMAIL PROTECTED]) [030614 03:59]:
> I've been working on converting my system over to using UTF-8 wherever
> possible. I've already configured galeon, evolution, gnome-terminal and
> just about every other graphical application to use UTF-8 by default.
> I've set my locale to "en_US.UTF-8". And just about everything works
> just fine. Unfortunately, as I'm not all that familiar with all of the
> details of an i18n interface, there are a few things that still elude
> me.
> 
> 1) I've set up an .Xmodmap file to map my left Windows key to Multi_key
> so that I can type extended characters. However, I have to run "xmodmap
> .Xmodmap" manually every time I restart X. I'm guessing that I should
> put this in an X startup script. A .bashrc equivalent for X.
> Unfortunately, I'm not sure what the proper file to put it in is.


If this is all your .Xmodmap file does, you might think about just using
        
        Option          "XkbModel"      "pc104compose"

in your /etc/X11/XF86Config-4 .  This will make the change "global":
every time the X server starts, the right windows key will by Multi_key.
No futzing with xmodmap required.  See /etc/X11/xkb/symbols/us (and
other files in that directory, if not using us) for different things you
can use for your XkbModel.

In fact, even if you are using xmodmap for other things, you might
consdier making the above change and removing the rwin->multi_key
mapping from your ~/.Xmodmap .  That's up to you.

As to getting xmodmap to load your ~/.Xmodmap each time you start X, you
might want to craft your own ~/.xsession .  Assuming
/etc/X11/Xsession.options contains allow-user-xsession (which it should,
by default), all you need to do is create a ~/.xesssion file, and the
global Xsession (/etc/X11/Xsession) will exec it by default, after
setting up any other neat tricks the debian packages have added to
/etc/X11/Xsession.d (i.e. starting an ssh-agent, etc.).

I probably shouldn't have mentioned so many files above; it may have
been confusing.  The short of it is that you create a ~/.xsession file
and put something like this in it:

xmodmap ~/.xmodmap
exec x-session-manager

# EOF

For another example, on my laptop I have no x-session-manager, I just
use WindowMaker.  My ~/.xsession looks like this:

screensaver -nosplash &
exec x-window-manager


> 5) Just to satisfy my own curiosity, could someone explain the
> difference between all of the different UTF flavors? I've seen UTF-7,
> UTF-8, UTF-16, etc. My first guess would be that the number represents
> the number of bits used to represent any single character. Yet that
> seems unlikely since UTF-8 has WELL over 255 characters. Could anyone
> enlighten me?

Before UTF-8 came along, there were UCS-2 and UCS-4, which used 2 and 4
bytes per character respectively.  The negative aspects were that files
consisting of only ASCII characters encoded in UCS-4, for example, would
be 4 times larger and incompatible with non-Unicode-aware tools.  UCS-2
could represent U0000-UFFFF, and UCS-4 U00000000-U7FFFFFFF.  I don't
know much about UTF-16 and UTF-32, but I know that they're compatible
with UCS-2 and UCS-4 respectively.  I believe UTF-16 has a 21-bit
capacity.

UTF-8 is a variable-length encoding.  (As I type that, I think to myself
that I should point out that I'm no expert in this field, and I may not
be using the canonical terminology.)  UTF-8 can represent up to
U7FFFFFFF, which means a whole heckuvalot of characters.  It works by
using 1-6 bytes per character.  The first 128 bytes are simply the ASCII
character set.  This is one of the reasons UTF-8 is great; it's
backwards-compatible ASCII, but it's not limited to 256 characters.  Let
me switch back to hex, since it's easier on my brain.  Then check out
this table:

U00000000-U0000007F     0xxxxxxx
U00000080-U000007FF     110xxxxx 10xxxxxx
U00000800-U0000FFFF     1110xxxx 10xxxxxx 10xxxxxx
U00010000-U001FFFFF     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U00020000-U03FFFFFF     111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U04000000-U7FFFFFFF     1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

(that was a fun little exercise in hex arithmetic!)

The 'x' characters represent bits used in encoding the character data.
The others are the overhead.  10xxxxxx is used as a continuation byte,
and any byte starting with 11xxxxxx is the start of a multi-byte
sequence.  The number of initial ones shows how long this sequence will
be.  The largest sequences, starting with 1111110x can represent a
31-bit character.

I believe that there's a unicode howto around.  I learned what I know
from Markus Kuhn's web site.  He also taught me about the ISO paper
sizes, which made me want to go out and buy A4 next time I run out of
paper! (stupid bass-ackwards US ...)

good times,
Vineet
-- 
http://www.doorstop.net/
-- 
Microsoft has argued that open source is bad for business, but you
have to ask, "Whose business? Theirs, or yours?"     --Tim O'Reilly

signature.asc
Description: Digital signature

Re: Converting to UTF-8 from ISO-8859

Reply via email to