Re: Question

miyata Sun, 21 Mar 1999 02:05:19 -0500
"Asger Alstrup Nielsen" <[EMAIL PROTECTED]> wrote:

> Also, the size issue is wrong:  Having an external file will require more
> space, because you need both the data and then a parser for the data on top of
> that.  Also, the size of all of these 14 encoding files is less than 20k, if
> you compile with -O2, and -fno-exceptions.  I think we can afford that
> considering what we get.
>
> The current approach is the simplest one, and that's the main reason for doing
> things like this.  We don't really need dynamically loaded encodings, IMO.

I haven't been contributing to LyX for too long.  So I decided to write
encoding converters for

  Big5
  CNS11643
  GB2312       (GB7589, GB7590, GB8565 and subset 7 are ignored until someone
  GB12345       supplies me an explanation: XLC_LOCALE file does not tell me
                how to map these subsets into EUC)
  JIS          (EUC mapped and Shift-JIS)
  KSC5601-1987 (Wansung + Unified Hangul)
  KSC5601-1992 (Johab)
and
  KanXi
  Morohashi

The encodings in the first group are standard and the mapping tables can be
more or less automatically generated with help of scripts.  The last two
are dictionaries with lot of characters and I happen to have BDF files.
Each dictionary contains approx. 50,000 characters, derfor most of them
must be mapped to the private region 0x000f0000--0x0010ffff or if you prefer
UTF-16 format, from 0xDB80 0xDC00 to 0xDBFF 0xDFFF (variable length encoding).

Current implementation for ISO-8859-X encodings uses
  - table lookup for toUnicode
  - sequential search for fromUnicode
The non symmetry here seems to come from the fact that scarcely inhabited
fromUnicode lookup tables would be too much a memory waste.  The tables for
sequential search is much smaller.
Obviously for DBCS encodings, sequential search is simply too inefficient:
Each encodings contains approx. 10,000 characters.
fromUnicode tables for lookup will be half filled, half empty.
What am I to do?  Are we going to have tables for both way conversions
(just like Qt 2.0), or is it a waste of memory?
A few possible ways are:
  - lookup tables for both way, toUnicode is an array of 10,000
    elements, fromUnicode is an array of 25,000 elements;
  - lookup table for toUnicode and sequential table for fromUnicode,
    both tables are arrays of size 10,000, we use binary search for
    fromUnicode instead of sequential search;
  - lookup table for toUnicode, skiplist will be made in EncSomething
    class constructor (dynamically) for fromUnicode conversion;
Any better ideas?

Regards,
        SMiyata
Re: Question

Reply via email to