>> That's fine when not every code point is used, but it's different for >> GB18030 where almost all code points are used. Using a plain array >> saves space and saves a binary search. > > Well, it doesn't save any space: if we get rid of the additional linear > ranges in the lookup table, what remains is 30733 entries requiring about > 256K, same as (or a bit less than) what you suggest.
We could do both. What about something like this: static unsigned int utf32_to_gb18030_from_0x0001[1105] = { /* 0x0 */ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, ... static unsigned int utf32_to_gb18030_from_0x2010[1587] = { /* 0x0 */ 0xa95c, 0x8136a532, 0x8136a533, 0xa843, 0xa1aa, 0xa844, 0xa1ac, 0x8136a534, ... static unsigned int utf32_to_gb18030_from_0x2E81[28965] = { /* 0x0 */ 0xfe50, 0x8138fd39, 0x8138fe30, 0xfe54, 0x8138fe31, 0x8138fe32, 0x8138fe33, 0xfe57, ... static unsigned int utf32_to_gb18030_from_0xE000[2149] = { /* 0x0 */ 0xaaa1, 0xaaa2, 0xaaa3, 0xaaa4, 0xaaa5, 0xaaa6, 0xaaa7, 0xaaa8, ... static unsigned int utf32_to_gb18030_from_0xF92C[254] = { /* 0x0 */ 0xfd9c, 0x84308535, 0x84308536, 0x84308537, 0x84308538, 0x84308539, 0x84308630, 0x84308631, ... static unsigned int utf32_to_gb18030_from_0xFE30[464] = { /* 0x0 */ 0xa955, 0xa6f2, 0x84318538, 0xa6f4, 0xa6f5, 0xa6e0, 0xa6e1, 0xa6f0, ... static uint32 conv_utf8_to_18030(uint32 code) { uint32 ucs = utf8word_to_unicode(code); #define conv_lin(minunicode, maxunicode, mincode) \ if (ucs >= minunicode && ucs <= maxunicode) \ return gb_unlinear(ucs - minunicode + gb_linear(mincode)) #define conv_array(minunicode, maxunicode) \ if (ucs >= minunicode && ucs <= maxunicode) \ return utf32_to_gb18030_from_##minunicode[ucs - minunicode]; conv_array(0x0001, 0x0452); conv_lin(0x0452, 0x200F, 0x8130D330); conv_array(0x2010, 0x2643); conv_lin(0x2643, 0x2E80, 0x8137A839); conv_array(0x2E81, 0x9FA6); conv_lin(0x9FA6, 0xD7FF, 0x82358F33); conv_array(0xE000, 0xE865); conv_lin(0xE865, 0xF92B, 0x8336D030); conv_array(0xF92C, 0xFA2A); conv_lin(0xFA2A, 0xFE2F, 0x84309C38); conv_array(0xFE30, 0x10000); conv_lin(0x10000, 0x10FFFF, 0x90308130); /* No mapping exists */ return 0; } > > The point about possibly being able to do this with a simple lookup table > instead of binary search is valid, but I still say it's a mistake to > suppose that we should consider that only for GB18030. With the reduced > table size, the GB18030 conversion tables are not all that far out of line > with the other Far Eastern conversions: > > $ size utf8*.so | sort -n > text data bss dec hex filename > 1880 512 16 2408 968 utf8_and_ascii.so > 2394 528 16 2938 b7a utf8_and_iso8859_1.so > 6674 512 16 7202 1c22 utf8_and_cyrillic.so > 24318 904 16 25238 6296 utf8_and_win.so > 28750 968 16 29734 7426 utf8_and_iso8859.so > 121110 512 16 121638 1db26 utf8_and_euc_cn.so > 123458 512 16 123986 1e452 utf8_and_sjis.so > 133606 512 16 134134 20bf6 utf8_and_euc_kr.so > 185014 512 16 185542 2d4c6 utf8_and_sjis2004.so > 185522 512 16 186050 2d6c2 utf8_and_euc2004.so > 212950 512 16 213478 341e6 utf8_and_euc_jp.so > 221394 512 16 221922 362e2 utf8_and_big5.so > 274772 512 16 275300 43364 utf8_and_johab.so > 277776 512 16 278304 43f20 utf8_and_uhc.so > 332262 512 16 332790 513f6 utf8_and_euc_tw.so > 350640 512 16 351168 55bc0 utf8_and_gbk.so > 496680 512 16 497208 79638 utf8_and_gb18030.so > > If we were to get excited about reducing the conversion time for GB18030, > it would clearly make sense to use similar infrastructure for GBK, and > perhaps the EUC encodings too. I'll check them as well. If they have linear ranges it should work. > > However, I'm not that excited about changing it. We have not heard field > complaints about these converters being too slow. What's more, there > doesn't seem to be any practical way to apply the same idea to the other > conversion direction, which means if you do feel there's a speed problem > this would only halfway fix it. It does work if you linearlize it first. That's why we need to convert to utf32 first as well. That's a form of linearization. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers