Unicode character support -- suggested improvement

Bill Janssen Tue, 20 Nov 2001 16:38:32 -0800

The fact that HTML is full Unicode is a bit disconcerting.  I'd like
to support that in the DB format, and eventually in the viewer as
well.  But at the same time I'd like to keep the text display fairly
simple so that the Latin-1 Palms can handle it.


Right now the TextParser looks at Unicode characters like #8211
(en-dash), and puts in an ASCII hyphen instead.  And so on for
half-a-dozen other Unicode characters.  Other characters are just
pushed in as "&#CODE;", where CODE is just the decimal code for
the character.

I think it would be nice to use a function code for this, as follows:

When a non-Latin-1 (non-ASCII?) character is encountered, add a
function code which describes two things: the numeric character code,
and the length of an alternative Latin-1 (ASCII?) string (like "-" for
en-dash, or "--" for em-dash).  This would be immediately followed by
the alternate string.  So given that information, the viewer could
either (1) ignore the function code, in which case the alternative
string would be presented (and the alternative string would be what's
currently presented), or (2) process the function code and display the
character, and "eat" the characters following the function code (the
alternative string), in which case the real Unicode character would
show up, or (3) some of (1) and some of (2), depending on what
characters the viewer has available.

I've implemented this, and tried it out, and approach (1) at least
works fine with the current viewer.  I've used function code 0x80 for
this purpose.

Bill

Unicode character support -- suggested improvement

Reply via email to