The fact that HTML is full Unicode is a bit disconcerting. I'd like to support that in the DB format, and eventually in the viewer as well. But at the same time I'd like to keep the text display fairly simple so that the Latin-1 Palms can handle it.
Right now the TextParser looks at Unicode characters like #8211 (en-dash), and puts in an ASCII hyphen instead. And so on for half-a-dozen other Unicode characters. Other characters are just pushed in as "&#CODE;", where CODE is just the decimal code for the character. I think it would be nice to use a function code for this, as follows: When a non-Latin-1 (non-ASCII?) character is encountered, add a function code which describes two things: the numeric character code, and the length of an alternative Latin-1 (ASCII?) string (like "-" for en-dash, or "--" for em-dash). This would be immediately followed by the alternate string. So given that information, the viewer could either (1) ignore the function code, in which case the alternative string would be presented (and the alternative string would be what's currently presented), or (2) process the function code and display the character, and "eat" the characters following the function code (the alternative string), in which case the real Unicode character would show up, or (3) some of (1) and some of (2), depending on what characters the viewer has available. I've implemented this, and tried it out, and approach (1) at least works fine with the current viewer. I've used function code 0x80 for this purpose. Bill