> I haven't been contributing to LyX for too long. So I decided to write
> encoding converters for
>
> Big5
> CNS11643
> GB2312 (GB7589, GB7590, GB8565 and subset 7 are ignored until someone
> GB12345 supplies me an explanation: XLC_LOCALE file does not tell me
> how to map these subsets into EUC)
> JIS (EUC mapped and Shift-JIS)
> KSC5601-1987 (Wansung + Unified Hangul)
> KSC5601-1992 (Johab)
> and
> KanXi
> Morohashi
Great!
> The encodings in the first group are standard and the mapping tables can be
> more or less automatically generated with help of scripts. The last two
> are dictionaries with lot of characters and I happen to have BDF files.
> Each dictionary contains approx. 50,000 characters, derfor most of them
> must be mapped to the private region 0x000f0000--0x0010ffff or if you prefer
> UTF-16 format, from 0xDB80 0xDC00 to 0xDBFF 0xDFFF (variable length encoding).
Hmm, that's a surprise to me. I suppose we have to turn to 32 bit wstrings...
Variable length encoding is not fun.
Alternatively, we have to use a selective encoding, where only the glyphs
that are used are mapped to the private region within 16 bits. However,
I think that will be too complicated to handle, so the 32 bit strings are
probably the best solution.
> Current implementation for ISO-8859-X encodings uses
> - table lookup for toUnicode
> - sequential search for fromUnicode
Notice that the implementations for ISO-8859-X encodings does not
use sequential search, but binary search. There are less then 100
characters in the ISO-8859-X family encodings that need to be searched,
and since they are sorted, the binary search makes at most 7 comparisions.
Don't be tricked by the code, because there is no apparent binary search
implemented. Instead, I use the "lower_bound" STL method. This method
performs the binary search.
If you have 50,000 characters, the corresponding search time is
15 comparisions. I think this is ok.
> Obviously for DBCS encodings, sequential search is simply too inefficient:
> Each encodings contains approx. 10,000 characters.
This corresponds to 13 comparisions.
> fromUnicode tables for lookup will be half filled, half empty.
> What am I to do? Are we going to have tables for both way conversions
> (just like Qt 2.0), or is it a waste of memory?
> - lookup table for toUnicode and sequential table for fromUnicode,
> both tables are arrays of size 10,000, we use binary search for
> fromUnicode instead of sequential search;
This is what the ISO-8859-X does. I think this will work fine
for 10,000 characters (or 50,000) as well.
Greets,
Asger