> I haven't been contributing to LyX for too long.  So I decided to write
> encoding converters for
> 
>   Big5
>   CNS11643
>   GB2312       (GB7589, GB7590, GB8565 and subset 7 are ignored until someone
>   GB12345       supplies me an explanation: XLC_LOCALE file does not tell me
>                 how to map these subsets into EUC)
>   JIS          (EUC mapped and Shift-JIS)
>   KSC5601-1987 (Wansung + Unified Hangul)
>   KSC5601-1992 (Johab)
> and
>   KanXi
>   Morohashi

Great!

> The encodings in the first group are standard and the mapping tables can be
> more or less automatically generated with help of scripts.  The last two
> are dictionaries with lot of characters and I happen to have BDF files.
> Each dictionary contains approx. 50,000 characters, derfor most of them
> must be mapped to the private region 0x000f0000--0x0010ffff or if you prefer
> UTF-16 format, from 0xDB80 0xDC00 to 0xDBFF 0xDFFF (variable length encoding).

Hmm, that's a surprise to me.  I suppose we have to turn to 32 bit wstrings...
Variable length encoding is not fun.

Alternatively, we have to use a selective encoding, where only the glyphs
that are used are mapped to the private region within 16 bits.  However,
I think that will be too complicated to handle, so the 32 bit strings are
probably the best solution.

> Current implementation for ISO-8859-X encodings uses
>   - table lookup for toUnicode
>   - sequential search for fromUnicode

Notice that the implementations for ISO-8859-X encodings does not
use sequential search, but binary search.  There are less then 100
characters in the ISO-8859-X family encodings that need to be searched,
and since they are sorted, the binary search makes at most 7 comparisions.

Don't be tricked by the code, because there is no apparent binary search
implemented.  Instead, I use the "lower_bound" STL method.  This method 
performs the binary search.

If you have 50,000 characters, the corresponding search time is
15 comparisions.  I think this is ok.

> Obviously for DBCS encodings, sequential search is simply too inefficient:
> Each encodings contains approx. 10,000 characters.

This corresponds to 13 comparisions.

> fromUnicode tables for lookup will be half filled, half empty.
> What am I to do?  Are we going to have tables for both way conversions
> (just like Qt 2.0), or is it a waste of memory?

>   - lookup table for toUnicode and sequential table for fromUnicode,
>     both tables are arrays of size 10,000, we use binary search for
>     fromUnicode instead of sequential search;

This is what the ISO-8859-X does.  I think this will work fine
for 10,000 characters (or 50,000) as well.

Greets,

Asger

Reply via email to