Re: [HACKERS] Supporting SJIS as a database encoding

Heikki Linnakangas Tue, 13 Sep 2016 01:44:40 -0700

On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.


Nice!

I was hoping that this would also make the binaries smaller. A few dozenkB of storage is perhaps not a big deal these days, but still. Andsmaller tables would also consume less memory and CPU cache.

I removed the #include "../../Unicode/utf8_to_sjis.map" line, so thatthe old table isn't included anymore, compiled, and ran "striputf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's160 kB. So the radix tree takes a little bit more space.

That's not too bad, and I'm sure we could live with that, but with a fewsimple tricks, we could do better. First, since all the values we storein the tree are < 0xffff, we could store them in int16 instead of int32,and halve the size of the table right off the bat. That won't work forall encodings, of course, but it might be worth it to have two versionsof the code, one for int16 and another for int32.

Another trick is to eliminate redundancies in the tables. Many of thetables contain lots of zeros, as in:

  /*   c3xx */{
    /*   c380 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /*   c388 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /*   c390 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x817e,
    /*   c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /*   c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /*   c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /*   c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x8180,
    /*   c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
  },

and

  /* e388xx */{
    /* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e388a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e388b0 */ 0x0000, 0xfa58, 0x878b, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    /* e388b8 */ 0x0000, 0x878c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
  },

You could overlay the last row of the first table, which is all zeros,with the first row of the second table, which is also all zeros. (Manyof the tables have a lot more zero-rows than this example.)

But yes, this patch looks very promising in general. I think we shouldswitch over to radix trees for the all the encodings.


- Heikki



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting SJIS as a database encoding

Reply via email to