Hello, thank you for polishing this. At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafsson <dan...@yesql.se> wrote in <80f34f25-bf6d-4bcd-9c38-42ed10d3f...@yesql.se> > > On 08 Nov 2016, at 17:37, Peter Eisentraut > > <peter.eisentr...@2ndquadrant.com> wrote: > > > > On 10/31/16 12:11 PM, Daniel Gustafsson wrote: > >> I took a small stab at doing some cleaning of the Perl scripts, mainly > >> around > >> using the more modern (well, modern as in +15 years old) form for open(..), > >> avoiding global filehandles for passing scalar references and enforcing use > >> strict. Some smaller typos and fixes were also included. It seems my > >> Perl has > >> become a bit rusty so I hope the changes make sense. The produced files > >> are > >> identical with these patches applied, they are merely doing cleaning as > >> opposed > >> to bugfixing. > >> > >> The attached patches are against the 0001-0006 patches from Heikki and you > >> in > >> this series of emails, the separation is intended to make them easier to > >> read. > > > > Cool. See also here: > > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net
> Nice, not having hacked much Perl in quite a while I had all but forgotten > about perlcritic. I tried it on CentOS7. Installation failed saying that Module::Build is too old. It is yum-inatlled so removed it and installed it with CPAN. Again failed with many 'Could not create MYMETA files'. Then tried to install CPAN::Meta and it failed saying that CPAN::Meta::YAML is too *new*. That sucks. So your patch is greately helpfull. Thank you. | -my @mapnames = map { s/\.map//; $_ } values %plainmaps; | +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps; It surprised me to know that perlcritic does such things. > Running it on the current version of the patchset yields mostly warnings on > string values used in the require “convutils.pm” statement. There were > however > two more interesting reports: one more open() call not using the three > parameter form and an instance of map which alters the input value. Sorry for overlooking it. > The latter > is not causing an issue since we don’t use the input list past the map but > fixing it seems like good form. Agreed. > Attached is a patch that addresses the perlcritic reports (running without any > special options). Thanks. The attached patch contains the patch by perlcritic. 0001,2,3 are Heikki's patch that are not modified since it is first proposed. It's a bit too big so I don't attach them to this mail (again). https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi 0004 is radix-tree stuff, applies on top of the three patches above. There's a hidden fifth patch which of 20MB in size. But it is generated by running make in the Unicode directory. [$(TOP)]$ ./configure ... [$(TOP)]$ make [Unicode]$ make [Unicode]$ make distclean [Unicode]$ git add . [Unicode]$ commit === COMMITE MESSSAGE Replace map files with radix tree files. These encodings no longer uses the former map files and uses new radix tree files. All existing authority files in this directory are removed. === regards,
>From e6718001cbe3f1937e4f052c66bff68f7217c43c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Thu, 27 Oct 2016 14:19:47 +0900 Subject: [PATCH 4/5] Make map generators to generate radix tree files. This introduces radix tree conversion to map-referring(non-arithmetic) code conversions. UCS_to*.pl files generate _radix.map files addition to .map files currently generated. These _radix.map files are referenced by conversion procs instead of .map files. Radix trees are not easily verified, so a checker is provided. "make mapcheck" builds and runs it. It verifies radix maps against corresponding .map files. Now 'make distclean' removes authority files and unused .map files since they should not be contained in source archive. On the other hand 'make maintainer-clean' leaves them and removes all map files. This seems somewhat strange but it comes from the special characteristics of this directory. All perl scripts turned into modern style. Use strict, modern usage of file handles and accept reindentation by pgperltidy. Perl scripts in this commit have been applied it. For the first time running "make all", some UCS_to*.pl scripts many times because of a limitation of gmake's capability. --- src/backend/utils/mb/Unicode/Makefile | 78 +- src/backend/utils/mb/Unicode/UCS_to_BIG5.pl | 33 +- src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl | 36 +- .../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl | 69 +- src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl | 42 +- src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl | 12 +- src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl | 23 +- src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 32 +- src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl | 10 +- .../utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl | 95 +- src/backend/utils/mb/Unicode/UCS_to_SJIS.pl | 37 +- src/backend/utils/mb/Unicode/UCS_to_UHC.pl | 40 +- src/backend/utils/mb/Unicode/UCS_to_most.pl | 17 +- src/backend/utils/mb/Unicode/convutils.pm | 1025 ++++++++++++++++++-- src/backend/utils/mb/Unicode/download_srctxts.sh | 127 +++ src/backend/utils/mb/Unicode/make_mapchecker.pl | 71 ++ src/backend/utils/mb/Unicode/map_checker.c | 147 +++ src/backend/utils/mb/conv.c | 171 +++- .../conversion_procs/utf8_and_big5/utf8_and_big5.c | 8 +- .../utf8_and_cyrillic/utf8_and_cyrillic.c | 16 +- .../utf8_and_euc2004/utf8_and_euc2004.c | 8 +- .../utf8_and_euc_cn/utf8_and_euc_cn.c | 8 +- .../utf8_and_euc_jp/utf8_and_euc_jp.c | 8 +- .../utf8_and_euc_kr/utf8_and_euc_kr.c | 8 +- .../utf8_and_euc_tw/utf8_and_euc_tw.c | 8 +- .../utf8_and_gb18030/utf8_and_gb18030.c | 8 +- .../conversion_procs/utf8_and_gbk/utf8_and_gbk.c | 8 +- .../utf8_and_iso8859/utf8_and_iso8859.c | 127 ++- .../utf8_and_johab/utf8_and_johab.c | 8 +- .../conversion_procs/utf8_and_sjis/utf8_and_sjis.c | 8 +- .../utf8_and_sjis2004/utf8_and_sjis2004.c | 8 +- .../conversion_procs/utf8_and_uhc/utf8_and_uhc.c | 8 +- .../conversion_procs/utf8_and_win/utf8_and_win.c | 98 +- src/include/mb/pg_wchar.h | 31 +- 34 files changed, 1919 insertions(+), 514 deletions(-) create mode 100755 src/backend/utils/mb/Unicode/download_srctxts.sh create mode 100755 src/backend/utils/mb/Unicode/make_mapchecker.pl create mode 100644 src/backend/utils/mb/Unicode/map_checker.c diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile index e11f1d7..f184f65 100644 --- a/src/backend/utils/mb/Unicode/Makefile +++ b/src/backend/utils/mb/Unicode/Makefile @@ -40,23 +40,30 @@ WINMAPS = win866_to_utf8.map utf8_to_win866.map \ GENERICMAPS = $(ISO8859MAPS) $(WINMAPS) \ gbk_to_utf8.map utf8_to_gbk.map \ - koi8r_to_utf8.map utf8_to_koi8r.map + koi8r_to_utf8.map utf8_to_koi8r.map \ + koi8u_to_utf8.map utf8_to_koi8u.map SPECIALMAPS = euc_cn_to_utf8.map utf8_to_euc_cn.map \ euc_jp_to_utf8.map utf8_to_euc_jp.map \ + euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map \ euc_kr_to_utf8.map utf8_to_euc_kr.map \ euc_tw_to_utf8.map utf8_to_euc_tw.map \ sjis_to_utf8.map utf8_to_sjis.map \ + shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map \ gb18030_to_utf8.map utf8_to_gb18030.map \ big5_to_utf8.map utf8_to_big5.map \ johab_to_utf8.map utf8_to_johab.map \ uhc_to_utf8.map utf8_to_uhc.map \ - euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_combined.map \ + +COMBINEDMAPS = euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_combined.map \ utf8_to_euc_jis_2004.map utf8_to_euc_jis_2004_combined.map \ shift_jis_2004_to_utf8.map shift_jis_2004_to_utf8_combined.map \ utf8_to_shift_jis_2004.map utf8_to_shift_jis_2004_combined.map -MAPS = $(GENERICMAPS) $(SPECIALMAPS) +MAPS = $(GENERICMAPS) $(SPECIALMAPS) $(COMBINEDMAPS) + +RADIXGENERICMAPS = $(subst .map,_radix.map,$(GENERICMAPS)) +RADIXMAPS = $(subst .map,_radix.map,$(GENERICMAPS) $(SPECIALMAPS)) ISO8859TEXTS = 8859-2.TXT 8859-3.TXT 8859-4.TXT 8859-5.TXT \ 8859-6.TXT 8859-7.TXT 8859-8.TXT 8859-9.TXT \ @@ -68,53 +75,76 @@ WINTEXTS = CP866.TXT CP874.TXT CP936.TXT \ CP1252.TXT CP1253.TXT CP1254.TXT CP1255.TXT \ CP1256.TXT CP1257.TXT CP1258.TXT +SPECIALTEXTS = BIG5.TXT CNS11643.TXT \ + CP932.TXT CP950.TXT \ + JIS0201.TXT JIS0208.TXT JIS0212.TXT SHIFTJIS.TXT \ + JOHAB.TXT KSX1001.TXT windows-949-2000.xml \ + euc-jis-2004-std.txt sjis-0213-2004-std.txt \ + gb-18030-2000.xml + GENERICTEXTS = $(ISO8859TEXTS) $(WINTEXTS) \ KOI8-R.TXT KOI8-U.TXT -all: $(MAPS) +TEXTS = $(GENERICTEXTS) $(WINTEXTS) $(ISO8859TEXTS) $(SPECIALTEXTS) + +OBJS = map_checker.o + +BINS = map_checker + +all: $(MAPS) $(RADIXMAPS) $(BINS) -$(GENERICMAPS): UCS_to_most.pl $(GENERICTEXTS) +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS) $(PERL) $< -johab_to_utf8.map utf8_to_johab.map: UCS_to_JOHAB.pl JOHAB.TXT +map_checker.o: map_checker.c map_checker.h + +map_checker: map_checker.o + +$(GENERICMAPS) $(RADIXGENERICMAPS): UCS_to_most.pl $(GENERICTEXTS) + $(PERL) $< + +johab_to_utf8.map utf8_to_johab.map johab_to_utf8_radix.map utf8_to_johab_radix.map: UCS_to_JOHAB.pl JOHAB.TXT $(PERL) $< -uhc_to_utf8.map utf8_to_uhc.map: UCS_to_UHC.pl windows-949-2000.xml +uhc_to_utf8.map utf8_to_uhc.map uhc_to_utf8_radix.map utf8_to_uhc_radix.map: UCS_to_UHC.pl windows-949-2000.xml $(PERL) $< -euc_jp_to_utf8.map utf8_to_euc_jp.map: UCS_to_EUC_JP.pl CP932.TXT JIS0212.TXT +euc_jp_to_utf8.map utf8_to_euc_jp.map euc_jp_to_utf8_radix.map utf8_to_euc_jp_radix.map: UCS_to_EUC_JP.pl CP932.TXT JIS0212.TXT $(PERL) $< -euc_cn_to_utf8.map utf8_to_euc_cn.map: UCS_to_EUC_CN.pl gb-18030-2000.xml +euc_cn_to_utf8.map utf8_to_euc_cn.map euc_cn_to_utf8_radix.map utf8_to_euc_cn_radix.map: UCS_to_EUC_CN.pl gb-18030-2000.xml $(PERL) $< -euc_kr_to_utf8.map utf8_to_euc_kr.map: UCS_to_EUC_KR.pl KSX1001.TXT +euc_kr_to_utf8.map utf8_to_euc_kr.map euc_kr_to_utf8_radix.map utf8_to_euc_kr_radix.map: UCS_to_EUC_KR.pl KSX1001.TXT $(PERL) $< -euc_tw_to_utf8.map utf8_to_euc_tw.map: UCS_to_EUC_TW.pl CNS11643.TXT +euc_tw_to_utf8.map utf8_to_euc_tw.map euc_tw_to_utf8_radix.map utf8_to_euc_tw_radix.map: UCS_to_EUC_TW.pl CNS11643.TXT $(PERL) $< -sjis_to_utf8.map utf8_to_sjis.map: UCS_to_SJIS.pl CP932.TXT +sjis_to_utf8.map utf8_to_sjis.map sjis_to_utf8_radix.map utf8_to_sjis_radix.map: UCS_to_SJIS.pl CP932.TXT $(PERL) $< -gb18030_to_utf8.map utf8_to_gb18030.map: UCS_to_GB18030.pl gb-18030-2000.xml +gb18030_to_utf8.map utf8_to_gb18030.map gb18030_to_utf8_radix.map utf8_to_gb18030_radix.map: UCS_to_GB18030.pl gb-18030-2000.xml $(PERL) $< -big5_to_utf8.map utf8_to_big5.map: UCS_to_BIG5.pl BIG5.TXT CP950.TXT +big5_to_utf8.map utf8_to_big5.map big5_to_utf8_radix.map utf8_to_big5_radix.map: UCS_to_BIG5.pl BIG5.TXT CP950.TXT $(PERL) $< -euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_combined.map utf8_to_euc_jis_2004.map utf8_to_euc_jis_2004_combined.map: UCS_to_EUC_JIS_2004.pl euc-jis-2004-std.txt +euc_jis_2004_to_utf8.map euc_jis_2004_to_utf8_radix.map euc_jis_2004_to_utf8_combined.map utf8_to_euc_jis_2004.map utf8_to_euc_jis_2004_radix.map utf8_to_euc_jis_2004_combined.map: UCS_to_EUC_JIS_2004.pl euc-jis-2004-std.txt $(PERL) $< -shift_jis_2004_to_utf8.map shift_jis_2004_to_utf8_combined.map utf8_to_shift_jis_2004.map utf8_to_shift_jis_2004_combined.map: UCS_to_SHIFT_JIS_2004.pl sjis-0213-2004-std.txt +shift_jis_2004_to_utf8.map shift_jis_2004_to_utf8_radix.map shift_jis_2004_to_utf8_combined.map utf8_to_shift_jis_2004.map utf8_to_shift_jis_2004_radix.map utf8_to_shift_jis_2004_combined.map: UCS_to_SHIFT_JIS_2004.pl sjis-0213-2004-std.txt $(PERL) $< -distclean: clean - rm -f $(TEXTS) +distclean: + rm -f $(TEXTS) $(GENERICMAPS) $(SPECIALMAPS) $(OBJS) $(BINS) map_checker.h -maintainer-clean: distclean - rm -f $(MAPS) +# maintainer-clean intentionally leaves $(TEXTS) +maintainer-clean: + rm -f $(MAPS) $(RADIXMAPS) $(GENERICMAPS) $(SPECIALMAPS) $(OBJS) $(BINS) map_checker.h +mapcheck: $(MAPS) $(RADIXMAPS) map_checker + ./map_checker DOWNLOAD = wget -O $@ --no-use-server-timestamps #DOWNLOAD = curl -o $@ @@ -122,12 +152,12 @@ DOWNLOAD = wget -O $@ --no-use-server-timestamps BIG5.TXT CNS11643.TXT: $(DOWNLOAD) http://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/$(@F) +gb-18030-2000.xml windows-949-2000.xml: + $(DOWNLOAD) http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/$(@F) + euc-jis-2004-std.txt sjis-0213-2004-std.txt: $(DOWNLOAD) http://x0213.org/codetable/$(@F) -gb-18030-2000.xml: - $(DOWNLOAD) https://ssl.icu-project.org/repos/icu/data/trunk/charset/data/xml/$(@F) - GB2312.TXT: $(DOWNLOAD) 'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842&format=txt' @@ -143,7 +173,7 @@ KOI8-R.TXT KOI8-U.TXT: $(ISO8859TEXTS): $(DOWNLOAD) http://ftp.unicode.org/Public/MAPPINGS/ISO8859/$(@F) -$(filter-out CP8%,$(WINTEXTS)): +$(filter-out CP8%,$(WINTEXTS)) $(filter CP9%, $(SPECIALTEXTS)): $(DOWNLOAD) http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/$(@F) $(filter CP8%,$(WINTEXTS)): diff --git a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl index 6a1321b..822ab28 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl @@ -24,8 +24,10 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) +use strict; +require convutils; -require "convutils.pm"; +my $this_script = $0; # Load BIG5.TXT my $all = &read_source("BIG5.TXT"); @@ -33,9 +35,10 @@ my $all = &read_source("BIG5.TXT"); # Load CP950.TXT my $cp950txt = &read_source("CP950.TXT"); -foreach my $i (@$cp950txt) { +foreach my $i (@$cp950txt) +{ my $code = $i->{code}; - my $ucs = $i->{ucs}; + my $ucs = $i->{ucs}; # Pick only the ETEN extended characters in the range 0xf9d6 - 0xf9dc # from CP950.TXT @@ -44,20 +47,22 @@ foreach my $i (@$cp950txt) { && $code >= 0xf9d6 && $code <= 0xf9dc) { - push @$all, {code => $code, - ucs => $ucs, - comment => $i->{comment}, - direction => "both"}; + push @$all, + { code => $code, + ucs => $ucs, + comment => $i->{comment}, + direction => "both" }; } } -foreach my $i (@$all) { +foreach my $i (@$all) +{ my $code = $i->{code}; - my $ucs = $i->{ucs}; + my $ucs = $i->{ucs}; - # BIG5.TXT maps several BIG5 characters to U+FFFD. The UTF-8 to BIG5 mapping can - # contain only one of them. XXX: Doesn't really make sense to include any of them, - # but for historical reasons, we map the first one of them. +# BIG5.TXT maps several BIG5 characters to U+FFFD. The UTF-8 to BIG5 mapping can +# contain only one of them. XXX: Doesn't really make sense to include any of them, +# but for historical reasons, we map the first one of them. if ($i->{ucs} == 0xFFFD && $i->{code} != 0xA15A) { $i->{direction} = "to_unicode"; @@ -65,4 +70,6 @@ foreach my $i (@$all) { } # Output -print_tables("BIG5", $all); +print_tables($this_script, "BIG5", $all); +print_radix_trees($this_script, "BIG5", $all); + diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl index 8df23f8..a933c12 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl @@ -13,32 +13,36 @@ # where the "u" field is the Unicode code point in hex, # and the "b" field is the hex byte sequence for GB18030 -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # Read the input -$in_file = "gb-18030-2000.xml"; +my $in_file = "gb-18030-2000.xml"; -open(FILE, $in_file) || die("cannot open $in_file"); +open(my $in, '<', $in_file) || die("cannot open $in_file"); my @mapping; -while (<FILE>) +while (<$in>) { next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/); - $u = $1; - $c = $2; + my ($u, $c) = ($1, $2); $c =~ s/ //g; - $ucs = hex($u); - $code = hex($c); + my $ucs = hex($u); + my $code = hex($c); # The GB-18030 character set, which we use as the source, contains # a lot of extra characters on top of the GB2312 character set that # EUC_CN encodes. Filter out those extra characters. + next if (($code & 0xFF) < 0xA1); +#<<< do not let perltidy touch this next if (!($code >= 0xA100 && $code <= 0xA9FF || $code >= 0xB000 && $code <= 0xF7FF)); - +#>>> next if ($code >= 0xA2A1 && $code <= 0xA2B0); next if ($code >= 0xA2E3 && $code <= 0xA2E4); next if ($code >= 0xA2EF && $code <= 0xA2F0); @@ -65,12 +69,12 @@ while (<FILE>) $ucs = 0x2015; } - push @mapping, { - ucs => $ucs, - code => $code, - direction => 'both' - } + push @mapping, + { ucs => $ucs, + code => $code, + direction => 'both' }; } -close(FILE); +close($in); -print_tables("EUC_CN", \@mapping); +print_tables($this_script, "EUC_CN", \@mapping); +print_radix_trees($this_script, "EUC_CN", \@mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl index b4e140b..38c0b2a 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl @@ -7,53 +7,54 @@ # Generate UTF-8 <--> EUC_JIS_2004 code conversion tables from # "euc-jis-2004-std.txt" (http://x0213.org) -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # first generate UTF-8 --> EUC_JIS_2004 table -$in_file = "euc-jis-2004-std.txt"; +my $in_file = "euc-jis-2004-std.txt"; -open(FILE, $in_file) || die("cannot open $in_file"); +open(my $in, '<', $in_file) || die("cannot open $in_file"); my @all; -while ($line = <FILE>) +while (my $line = <$in>) { if ($line =~ /^0x(.*)[ \t]*U\+(.*)\+(.*)[ \t]*#(.*)$/) { - $c = $1; - $u1 = $2; - $u2 = $3; - $rest = "U+" . $u1 . "+" . $u2 . $4; - $code = hex($c); - $ucs1 = hex($u1); - $ucs2 = hex($u2); - - push @all, { direction => 'both', - ucs => $ucs1, - ucs_second => $ucs2, - code => $code, - comment => $rest }; - next; + # combined characters + my ($c, $u1, $u2) = ($1, $2, $3); + my $rest = "U+" . $u1 . "+" . $u2 . $4; + my $code = hex($c); + my $ucs1 = hex($u1); + my $ucs2 = hex($u2); + + push @all, + { direction => 'both', + ucs => $ucs1, + ucs_second => $ucs2, + code => $code, + comment => $rest }; } elsif ($line =~ /^0x(.*)[ \t]*U\+(.*)[ \t]*#(.*)$/) { - $c = $1; - $u = $2; - $rest = "U+" . $u . $3; - } - else - { - next; + # non-combined characters + my ($c, $u, $rest) = ($1, $2, "U+" . $2 . $3); + my $ucs = hex($u); + my $code = hex($c); + + next if ($code < 0x80 && $ucs < 0x80); + + push @all, + { direction => 'both', + ucs => $ucs, + code => $code, + comment => $rest }; } - - $ucs = hex($u); - $code = hex($c); - - next if ($code < 0x80 && $ucs < 0x80); - - push @all, { direction => 'both', ucs => $ucs, code => $code, comment => $rest }; } -close(FILE); +close($in); -print_tables("EUC_JIS_2004", \@all, 1); +print_tables($this_script, "EUC_JIS_2004", \@all, 1); +print_radix_trees($this_script, "EUC_JIS_2004", \@all); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl index 0e9dd29..5ac3542 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl @@ -12,14 +12,17 @@ # organization's ftp site. use strict; -require "convutils.pm"; +require convutils; + +my $this_script = $0; # Load JIS0212.TXT my $jis0212 = &read_source("JIS0212.TXT"); my @mapping; -foreach my $i (@$jis0212) { +foreach my $i (@$jis0212) +{ # We have a different mapping for this in the EUC_JP to UTF-8 direction. if ($i->{code} == 0x2243) { @@ -46,13 +49,14 @@ foreach my $i (@$jis0212) { # Load CP932.TXT. my $ct932 = &read_source("CP932.TXT"); -foreach my $i (@$ct932) { +foreach my $i (@$ct932) +{ my $sjis = $i->{code}; # We have a different mapping for this in the EUC_JP to UTF-8 direction. - if ($sjis == 0xeefa || - $sjis == 0xeefb || - $sjis == 0xeefc) + if ( $sjis == 0xeefa + || $sjis == 0xeefb + || $sjis == 0xeefc) { next; } @@ -61,9 +65,10 @@ foreach my $i (@$ct932) { { my $jis = &sjis2jis($sjis); - $i->{code} = $jis | ($jis < 0x100 ? 0x8e00 : - ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); - +#<<< do not let perltidy touch this + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00: + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); +#>>> # Remember the SJIS code for later. $i->{sjis} = $sjis; @@ -71,13 +76,14 @@ foreach my $i (@$ct932) { } } -foreach my $i (@mapping) { +foreach my $i (@mapping) +{ my $sjis = $i->{sjis}; # These SJIS characters are excluded completely. - if ($sjis >= 0xed00 && $sjis <= 0xeef9 || - $sjis >= 0xfa54 && $sjis <= 0xfa56 || - $sjis >= 0xfa58 && $sjis <= 0xfc4b) + if ( $sjis >= 0xed00 && $sjis <= 0xeef9 + || $sjis >= 0xfa54 && $sjis <= 0xfa56 + || $sjis >= 0xfa58 && $sjis <= 0xfc4b) { $i->{direction} = "none"; next; @@ -90,6 +96,7 @@ foreach my $i (@mapping) { next; } +#<<< do not let perltidy touch this if ($sjis == 0x8790 || $sjis == 0x8791 || $sjis == 0x8792 || $sjis == 0x8795 || $sjis == 0x8796 || $sjis == 0x8797 || $sjis == 0x879a || $sjis == 0x879b || $sjis == 0x879c || @@ -190,8 +197,11 @@ push @mapping, ( {direction => 'to_unicode', ucs => 0x2121, code => 0x8ff4ad, comment => '# TELEPHONE SIGN'}, {direction => 'to_unicode', ucs => 0x3231, code => 0x8ff4ab, comment => '# PARENTHESIZED IDEOGRAPH STOCK'} ); +#>>> + +print_tables($this_script, "EUC_JP", \@mapping); +print_radix_trees($this_script, "EUC_JP", \@mapping); -print_tables("EUC_JP", \@mapping); ####################################################################### # sjis2jis ; SJIS => JIS conversion @@ -213,12 +223,12 @@ sub sjis2jis if ($pos >= 114 * 0x5e && $pos <= 115 * 0x5e + 0x1b) { # This region (115-ku) is out of range of JIS code but for - # convenient to generate code in EUC CODESET 3, move this to + # convenience to generate code in EUC CODESET 3, move this to # seemingly duplicate region (83-84-ku). $pos = $pos - ((31 * 0x5e) + 12); # after 85-ku 82-ten needs to be moved 2 codepoints - $pos = $pos - 2 if ($pos >= 84 * 0x5c + 82) + $pos = $pos - 2 if ($pos >= 84 * 0x5c + 82); } my $hi2 = $pos / 0x5e; diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl index a917d06..d17d777 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl @@ -16,7 +16,10 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # Load the source file. @@ -28,10 +31,13 @@ foreach my $i (@$mapping) } # Some extra characters that are not in KSX1001.TXT -push @$mapping, ( +#<<< do not let perltidy touch this +push @$mapping,( {direction => 'both', ucs => 0x20AC, code => 0xa2e6, comment => '# EURO SIGN'}, {direction => 'both', ucs => 0x00AE, code => 0xa2e7, comment => '# REGISTERED SIGN'}, {direction => 'both', ucs => 0x327E, code => 0xa2e8, comment => '# CIRCLED HANGUL IEUNG U'} ); +#>>> -print_tables("EUC_KR", $mapping); +print_tables($this_script, "EUC_KR", $mapping); +print_radix_trees($this_script, "EUC_KR", $mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl index aceef54..603edc4 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl @@ -17,7 +17,10 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; my $mapping = &read_source("CNS11643.TXT"); @@ -25,8 +28,8 @@ my @extras; foreach my $i (@$mapping) { - my $ucs = $i->{ucs}; - my $code = $i->{code}; + my $ucs = $i->{ucs}; + my $code = $i->{code}; my $origcode = $i->{code}; my $plane = ($code & 0x1f0000) >> 16; @@ -49,15 +52,15 @@ foreach my $i (@$mapping) # Some codes are mapped twice in the EUC_TW to UTF-8 table. if ($origcode >= 0x12121 && $origcode <= 0x20000) { - push @extras, { - ucs => $i->{ucs}, - code => ($i->{code} + 0x8ea10000), - rest => $i->{rest}, - direction => 'to_unicode' - } + push @extras, + { ucs => $i->{ucs}, + code => ($i->{code} + 0x8ea10000), + rest => $i->{rest}, + direction => 'to_unicode' }; } } push @$mapping, @extras; -print_tables("EUC_TW", $mapping); +print_tables($this_script, "EUC_TW", $mapping); +print_radix_trees($this_script, "EUC_TW", $mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl index f583610..e20b4a8 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl @@ -13,33 +13,35 @@ # where the "u" field is the Unicode code point in hex, # and the "b" field is the hex byte sequence for GB18030 -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # Read the input -$in_file = "gb-18030-2000.xml"; +my $in_file = "gb-18030-2000.xml"; -open(FILE, $in_file) || die("cannot open $in_file"); +open(my $in, '<', $in_file) || die("cannot open $in_file"); my @mapping; -while (<FILE>) +while (<$in>) { next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/); - $u = $1; - $c = $2; + my ($u, $c) = ($1, $2); $c =~ s/ //g; - $ucs = hex($u); - $code = hex($c); + my $ucs = hex($u); + my $code = hex($c); if ($code >= 0x80 && $ucs >= 0x0080) { - push @mapping, { - ucs => $ucs, - code => $code, - direction => 'both' - } + push @mapping, + { ucs => $ucs, + code => $code, + direction => 'both' }; } } -close(FILE); +close($in); -print_tables("GB18030", \@mapping); +print_tables($this_script, "GB18030", \@mapping); +print_radix_trees($this_script, "GB18030", \@mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl b/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl index b98f9a7..2dc9fb3 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl @@ -15,17 +15,23 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # Load the source file. my $mapping = &read_source("JOHAB.TXT"); # Some extra characters that are not in JOHAB.TXT +#<<< do not let perltidy touch this push @$mapping, ( {direction => 'both', ucs => 0x20AC, code => 0xd9e6, comment => '# EURO SIGN'}, {direction => 'both', ucs => 0x00AE, code => 0xd9e7, comment => '# REGISTERED SIGN'}, {direction => 'both', ucs => 0x327E, code => 0xd9e8, comment => '# CIRCLED HANGUL IEUNG U'} ); +#>>> -print_tables("JOHAB", $mapping); +print_tables($this_script, "JOHAB", $mapping); +print_radix_trees($this_script, "JOHAB", $mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl b/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl index 16a53ad..51ab6a1 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl @@ -7,75 +7,68 @@ # Generate UTF-8 <--> SHIFT_JIS_2004 code conversion tables from # "sjis-0213-2004-std.txt" (http://x0213.org) -require "convutils.pm"; +use strict; +require convutils; # first generate UTF-8 --> SHIFT_JIS_2004 table -$in_file = "sjis-0213-2004-std.txt"; +my $this_script = $0; -open(FILE, $in_file) || die("cannot open $in_file"); +my $in_file = "sjis-0213-2004-std.txt"; + +open(my $in, '<', $in_file) || die("cannot open $in_file"); my @mapping; -while ($line = <FILE>) +while (my $line = <$in>) { if ($line =~ /^0x(.*)[ \t]*U\+(.*)\+(.*)[ \t]*#(.*)$/) { - $c = $1; - $u1 = $2; - $u2 = $3; - $rest = "U+" . $u1 . "+" . $u2 . $4; - $code = hex($c); - $ucs1 = hex($u1); - $ucs2 = hex($u2); + my ($c, $u1, $u2) = ($1, $2, $3); + my $rest = "U+" . $u1 . "+" . $u2 . $4; + my $code = hex($c); + my $ucs1 = hex($u1); + my $ucs2 = hex($u2); - push @mapping, { - code => $code, - ucs => $ucs1, + push @mapping, + { code => $code, + ucs => $ucs1, ucs_second => $ucs2, - comment => $rest, - direction => 'both' - }; + comment => $rest, + direction => 'both' }; next; } elsif ($line =~ /^0x(.*)[ \t]*U\+(.*)[ \t]*#(.*)$/) { - $c = $1; - $u = $2; - $rest = "U+" . $u . $3; - } - else - { - next; - } + my ($c, $u, $rest) = ($1, $2, "U+" . $2 . $3); + my ($ucs, $code) = (hex($u), hex($c)); + my $direction; - $ucs = hex($u); - $code = hex($c); + if ($code < 0x80 && $ucs < 0x80) + { + next; + } + elsif ($code < 0x80) + { + $direction = 'from_unicode'; + } + elsif ($ucs < 0x80) + { + $direction = 'to_unicode'; + } + else + { + $direction = 'both'; + } - if ($code < 0x80 && $ucs < 0x80) - { - next; + push @mapping, + { code => $code, + ucs => $ucs, + comment => $rest, + direction => $direction }; } - elsif ($code < 0x80) - { - $direction = 'from_unicode'; - } - elsif ($ucs < 0x80) - { - $direction = 'to_unicode'; - } - else - { - $direction = 'both'; - } - - push @mapping, { - code => $code, - ucs => $ucs, - comment => $rest, - direction => $direction - }; } -close(FILE); +close($in); -print_tables("SHIFT_JIS_2004", \@mapping, 1); +print_tables($this_script, "SHIFT_JIS_2004", \@mapping, 1); +print_radix_trees($this_script, "SHIFT_JIS_2004", \@mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl index c8ff712..ffeb65f 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl @@ -11,38 +11,45 @@ # ftp site. use strict; -require "convutils.pm"; +require convutils; -my $charset = read_source("CP932.TXT"); +my $this_script = $0; + +my $mapping = read_source("CP932.TXT"); # Drop these SJIS codes from the source for UTF8=>SJIS conversion -my @reject_sjis =( +#<<< do not let perltidy touch this +my @reject_sjis = ( 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, 0x879a..0x879c -); + ); +#>>> -foreach my $i (@$charset) +foreach my $i (@$mapping) { my $code = $i->{code}; - my $ucs = $i->{ucs}; + my $ucs = $i->{ucs}; - if (grep {$code == $_} @reject_sjis) + if (grep { $code == $_ } @reject_sjis) { $i->{direction} = "to_unicode"; } } # Add these UTF8->SJIS pairs to the table. -push @$charset, ( - {direction => "from_unicode", ucs => 0x00a2, code => 0x8191, comment => '# CENT SIGN'}, - {direction => "from_unicode", ucs => 0x00a3, code => 0x8192, comment => '# POUND SIGN'}, - {direction => "from_unicode", ucs => 0x00a5, code => 0x5c, comment => '# YEN SIGN'}, - {direction => "from_unicode", ucs => 0x00ac, code => 0x81ca, comment => '# NOT SIGN'}, +#<<< do not let perltidy touch this +push @$mapping, ( + {direction => "from_unicode", ucs => 0x00a2, code => 0x8191, comment => '# CENT SIGN'}, + {direction => "from_unicode", ucs => 0x00a3, code => 0x8192, comment => '# POUND SIGN'}, + {direction => "from_unicode", ucs => 0x00a5, code => 0x5c, comment => '# YEN SIGN'}, + {direction => "from_unicode", ucs => 0x00ac, code => 0x81ca, comment => '# NOT SIGN'}, {direction => "from_unicode", ucs => 0x2016, code => 0x8161, comment => '# DOUBLE VERTICAL LINE'}, {direction => "from_unicode", ucs => 0x203e, code => 0x7e, comment => '# OVERLINE'}, {direction => "from_unicode", ucs => 0x2212, code => 0x817c, comment => '# MINUS SIGN'}, {direction => "from_unicode", ucs => 0x301c, code => 0x8160, comment => '# WAVE DASH'} -); + ); +#>>> -print_tables("SJIS", $charset); +print_tables($this_script, "SJIS", $mapping); +print_radix_trees($this_script, "SJIS", $mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_UHC.pl b/src/backend/utils/mb/Unicode/UCS_to_UHC.pl index b6bf3bd..2905b95 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_UHC.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_UHC.pl @@ -13,39 +13,45 @@ # where the "u" field is the Unicode code point in hex, # and the "b" field is the hex byte sequence for UHC -require "convutils.pm"; +use strict; +require convutils; + +my $this_script = $0; # Read the input -$in_file = "windows-949-2000.xml"; +my $in_file = "windows-949-2000.xml"; -open(FILE, $in_file) || die("cannot open $in_file"); +open(my $in, '<', $in_file) || die("cannot open $in_file"); my @mapping; -while (<FILE>) +while (<$in>) { next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/); - $u = $1; - $c = $2; + my ($u, $c) = ($1, $2); $c =~ s/ //g; - $ucs = hex($u); - $code = hex($c); + my $ucs = hex($u); + my $code = hex($c); next if ($code == 0x0080 || $code == 0x00FF); if ($code >= 0x80 && $ucs >= 0x0080) { - push @mapping, { - ucs => $ucs, - code => $code, - direction => 'both' - } + push @mapping, + { ucs => $ucs, + code => $code, + direction => 'both' }; } } -close(FILE); +close($in); # One extra character that's not in the source file. -push @mapping, { direction => 'both', code => 0xa2e8, ucs => 0x327e, comment => 'CIRCLED HANGUL IEUNG U' }; - -print_tables("UHC", \@mapping); +push @mapping, + { direction => 'both', + code => 0xa2e8, + ucs => 0x327e, + comment => 'CIRCLED HANGUL IEUNG U' }; + +print_tables($this_script, "UHC", \@mapping); +print_radix_trees($this_script, "UHC", \@mapping); diff --git a/src/backend/utils/mb/Unicode/UCS_to_most.pl b/src/backend/utils/mb/Unicode/UCS_to_most.pl index a3cf436..572c232 100755 --- a/src/backend/utils/mb/Unicode/UCS_to_most.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_most.pl @@ -15,9 +15,12 @@ # UCS-2 code in hex # # and Unicode name (not used in this script) -require "convutils.pm"; +use strict; +require convutils; -%filename = ( +my $this_script = $0; + +my %filename = ( 'WIN866' => 'CP866.TXT', 'WIN874' => 'CP874.TXT', 'WIN1250' => 'CP1250.TXT', @@ -46,11 +49,13 @@ require "convutils.pm"; 'KOI8U' => 'KOI8-U.TXT', 'GBK' => 'CP936.TXT'); -@charsets = keys(%filename); -@charsets = @ARGV if scalar(@ARGV); -foreach $charset (@charsets) +# make maps for all encodings if not specfied +my @charsets = (scalar(@ARGV) > 0) ? @ARGV : keys(%filename); + +foreach my $charset (@charsets) { my $mapping = &read_source($filename{$charset}); - print_tables($charset, $mapping); + print_tables($this_script, $charset, $mapping); + print_radix_trees($this_script, $charset, $mapping); } diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm index ac8eda4..26a42d3 100644 --- a/src/backend/utils/mb/Unicode/convutils.pm +++ b/src/backend/utils/mb/Unicode/convutils.pm @@ -3,13 +3,15 @@ # # src/backend/utils/mb/Unicode/convutils.pm +use strict; + ####################################################################### # convert UCS-4 to UTF-8 # sub ucs2utf { - local ($ucs) = @_; - local $utf; + my ($ucs) = @_; + my $utf; if ($ucs <= 0x007f) { @@ -44,29 +46,33 @@ sub read_source my ($fname) = @_; my @r; - open(my $in, $fname) || die("cannot open $fname"); + open(my $in, '<', $fname) || die("cannot open $fname"); while (<$in>) { next if (/^#/); chop; - next if (/^$/); # Ignore empty lines + next if (/^$/); # Ignore empty lines next if (/^0x([0-9A-F]+)\s+(#.*)$/); # Skip the first column for JIS0208.TXT + #<<< do not let perltidy touch this if (!/^0x([0-9A-Fa-f]+)\s+0x([0-9A-Fa-f]+)\s+(?:0x([0-9A-Fa-f]+)\s+)?(#.*)$/) { print STDERR "READ ERROR at line $. in $fname: $_\n"; exit; } - my $out = {f => $fname, l => $., - code => hex($1), - ucs => hex($2), - comment => $4, - direction => "both" - }; + #>>> + + my $out = { + f => $fname, + l => $., + code => hex($1), + ucs => hex($2), + comment => $4, + direction => "both" }; # Ignore pure ASCII mappings. PostgreSQL character conversion code # never even passes these to the conversion code. @@ -83,6 +89,7 @@ sub read_source # print_tables : output mapping tables # # Arguments: +# this_script - the name of the *caller script* of this feature # charset - string name of the character set. # table - mapping table (see format below) # verbose - if 1, output comment on each line, @@ -104,7 +111,7 @@ sub read_source # sub print_tables { - my ($charset, $table, $verbose) = @_; + my ($this_script, $charset, $table, $verbose) = @_; # Build an array with only the to-UTF8 direction mappings my @to_unicode; @@ -116,165 +123,1031 @@ sub print_tables { if (defined $i->{ucs_second}) { - my $entry = {utf8 => ucs2utf($i->{ucs}), - utf8_second => ucs2utf($i->{ucs_second}), - code => $i->{code}, - comment => $i->{comment}, - f => $i->{f}, l => $i->{l}}; + my $entry = { + utf8 => ucs2utf($i->{ucs}), + utf8_second => ucs2utf($i->{ucs_second}), + code => $i->{code}, + comment => $i->{comment}, + f => $i->{f}, + l => $i->{l} }; if ($i->{direction} eq "both" || $i->{direction} eq "to_unicode") { push @to_unicode_combined, $entry; } - if ($i->{direction} eq "both" || $i->{direction} eq "from_unicode") + if ( $i->{direction} eq "both" + || $i->{direction} eq "from_unicode") { push @from_unicode_combined, $entry; } } else { - my $entry = {utf8 => ucs2utf($i->{ucs}), - code => $i->{code}, - comment => $i->{comment}, - f => $i->{f}, l => $i->{l}}; + my $entry = { + utf8 => ucs2utf($i->{ucs}), + code => $i->{code}, + comment => $i->{comment}, + f => $i->{f}, + l => $i->{l} }; if ($i->{direction} eq "both" || $i->{direction} eq "to_unicode") { push @to_unicode, $entry; } - if ($i->{direction} eq "both" || $i->{direction} eq "from_unicode") + if ( $i->{direction} eq "both" + || $i->{direction} eq "from_unicode") { push @from_unicode, $entry; } } } - print_to_utf8_map($charset, \@to_unicode, $verbose); - print_to_utf8_combined_map($charset, \@to_unicode_combined, $verbose) if (scalar @to_unicode_combined > 0); - print_from_utf8_map($charset, \@from_unicode, $verbose); - print_from_utf8_combined_map($charset, \@from_unicode_combined, $verbose) if (scalar @from_unicode_combined > 0); + print_to_utf8_map($this_script, $charset, \@to_unicode, $verbose); + if (scalar @to_unicode_combined > 0) + { + print_to_utf8_combined_map($this_script, $charset, + \@to_unicode_combined, $verbose); + } + print_from_utf8_map($this_script, $charset, \@from_unicode, $verbose); + if (scalar @from_unicode_combined > 0) + { + print_from_utf8_combined_map($this_script, $charset, + \@from_unicode_combined, $verbose); + } } sub print_from_utf8_map { - my ($charset, $table, $verbose) = @_; + my ($this_script, $charset, $table, $verbose) = @_; my $last_comment = ""; my $fname = lc("utf8_to_${charset}.map"); print "- Writing UTF8=>${charset} conversion table: $fname\n"; - open(my $out, "> $fname") || die "cannot open output file : $fname\n"; - printf($out "/* src/backend/utils/mb/Unicode/$fname */\n\n". - "static const pg_utf_to_local ULmap${charset}[ %d ] = {", - scalar(@$table)); + open(my $out, '>', $fname) || die "cannot open output file : $fname\n"; + printf $out "/* src/backend/utils/mb/Unicode/$fname */\n" + . "/* This file is generated by $this_script */\n\n" + . "static const pg_utf_to_local ULmap${charset}[ %d ] = {", + scalar(@$table); my $first = 1; - foreach my $i (sort {$$a{utf8} <=> $$b{utf8}} @$table) - { - print($out ",") if (!$first); + foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table) + { + print $out "," if (!$first); $first = 0; - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); - printf($out "\n {0x%04x, 0x%04x}", $$i{utf8}, $$i{code}); + printf $out "\n {0x%04x, 0x%04x}", $i->{utf8}, $i->{code}; if ($verbose >= 2) { - $last_comment = "$$i{f}:$$i{l} $$i{comment}"; + $last_comment = + sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment}); } else { - $last_comment = $$i{comment}; + $last_comment = $i->{comment}; } } - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); print $out "\n};\n"; close($out); } sub print_from_utf8_combined_map { - my ($charset, $table, $verbose) = @_; + my ($this_script, $charset, $table, $verbose) = @_; my $last_comment = ""; my $fname = lc("utf8_to_${charset}_combined.map"); print "- Writing UTF8=>${charset} conversion table: $fname\n"; - open(my $out, "> $fname") || die "cannot open output file : $fname\n"; - printf($out "/* src/backend/utils/mb/Unicode/$fname */\n\n". - "static const pg_utf_to_local_combined ULmap${charset}_combined[ %d ] = {", - scalar(@$table)); + open(my $out, '>', $fname) || die "cannot open output file : $fname\n"; + printf $out "/* src/backend/utils/mb/Unicode/$fname */\n" + . "/* This file is generated by $this_script */\n\n" + . "static const pg_utf_to_local_combined ULmap${charset}_combined[ %d ] = {", + scalar(@$table); my $first = 1; - foreach my $i (sort {$$a{utf8} <=> $$b{utf8}} @$table) - { - print($out ",") if (!$first); + foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table) + { + print $out "," if (!$first); $first = 0; - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); - printf($out "\n {0x%08x, 0x%08x, 0x%04x}", $$i{utf8}, $$i{utf8_second}, $$i{code}); - $last_comment = "$$i{comment}"; + printf $out "\n {0x%08x, 0x%08x, 0x%04x}", + $i->{utf8}, $i->{utf8_second}, $i->{code}; + $last_comment = $i->{comment}; } - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); print $out "\n};\n"; close($out); } sub print_to_utf8_map { - my ($charset, $table, $verbose) = @_; + my ($this_script, $charset, $table, $verbose) = @_; my $last_comment = ""; my $fname = lc("${charset}_to_utf8.map"); print "- Writing ${charset}=>UTF8 conversion table: $fname\n"; - open(my $out, "> $fname") || die "cannot open output file : $fname\n"; - printf($out "/* src/backend/utils/mb/Unicode/${fname} */\n\n". - "static const pg_local_to_utf LUmap${charset}[ %d ] = {", - scalar(@$table)); + open(my $out, '>', $fname) || die "cannot open output file : $fname\n"; + printf $out "/* src/backend/utils/mb/Unicode/$fname */\n" + . "/* This file is generated by $this_script */\n\n" + . "static const pg_local_to_utf LUmap${charset}[ %d ] = {", + scalar(@$table); my $first = 1; - foreach my $i (sort {$$a{code} <=> $$b{code}} @$table) - { - print($out ",") if (!$first); + foreach my $i (sort { $a->{code} <=> $b->{code} } @$table) + { + print $out "," if (!$first); $first = 0; - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); - printf($out "\n {0x%04x, 0x%x}", $$i{code}, $$i{utf8}); + printf $out "\n {0x%04x, 0x%x}", $i->{code}, $i->{utf8}; if ($verbose >= 2) { - $last_comment = "$$i{f}:$$i{l} $$i{comment}"; + $last_comment = + sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment}); } else { - $last_comment = $$i{comment}; + $last_comment = $i->{comment}; } } - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); print $out "\n};\n"; close($out); } sub print_to_utf8_combined_map { - my ($charset, $table, $verbose) = @_; + my ($this_script, $charset, $table, $verbose) = @_; my $last_comment = ""; my $fname = lc("${charset}_to_utf8_combined.map"); print "- Writing ${charset}=>UTF8 conversion table: $fname\n"; - open(my $out, "> $fname") || die "cannot open output file : $fname\n"; - printf($out "/* src/backend/utils/mb/Unicode/${fname} */\n\n". - "static const pg_local_to_utf_combined LUmap${charset}_combined[ %d ] = {", - scalar(@$table)); + open(my $out, '>', $fname) || die "cannot open output file : $fname\n"; + printf $out "/* src/backend/utils/mb/Unicode/$fname */\n" + . "/* This file is generated by $this_script */\n\n" + . "static const pg_local_to_utf_combined LUmap${charset}_combined[ %d ] = {", + scalar(@$table); my $first = 1; - foreach my $i (sort {$$a{code} <=> $$b{code}} @$table) - { - print($out ",") if (!$first); + foreach my $i (sort { $a->{code} <=> $b->{code} } @$table) + { + print $out "," if (!$first); $first = 0; - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); - printf($out "\n {0x%04x, 0x%08x, 0x%08x}", $$i{code}, $$i{utf8}, $$i{utf8_second}); - $last_comment = "$$i{comment}"; + printf $out "\n {0x%04x, 0x%08x, 0x%08x}", + $i->{code}, $i->{utf8}, $i->{utf8_second}; + $last_comment = $i->{comment}; } - print($out "\t/* $last_comment */") if ($verbose); + print $out "\t/* $last_comment */" if ($verbose); print $out "\n};\n"; close($out); } +############################################################################# +# RADIX TREE STUFF + +# C struct type names : see wchar.h +my $radix_type = "pg_mb_radix_tree"; +my $radix_node_type = "pg_mb_radix_index"; + +######################################### +# read_maptable(<map file name>) +# +# extract data from map files and returns a character map table. +# returns a reference to a hash <in code> => <out code> +sub read_maptable +{ + my ($fname) = @_; + my %c; + + open(my $in, '<', $fname) || die("cannot open $fname"); + + while (<$in>) + { + if (/^[ \t]*{0x([0-9a-f]+), *0x([0-9a-f]+)},?/) + { + $c{ hex($1) } = hex($2); + } + } + + close($in); + return \%c; +} + +######################################### +# generate_index(<charmap hash ref>) +# +# generate a radix tree data from a character table +# returns a hashref to an index data. +# { +# csegs => <character segment index> +# b2idx => [<tree index of 1st byte of 2-byte code>] +# b3idx => [<idx for 1st byte for 3-byte code>, <2nd>] +# b4idx => [<idx for 1st byte for 4-byte code>, <2nd>, <3rd>] +# } +# +# Tables are in two forms, flat and segmented. a segmented table is +# logically a two-dimentional table but physically a sequence of +# segments, fixed length block of items. This structure allows us to +# shrink table size by overlapping a shared sequence of zeros between +# successive two segments. overlap_segments does that step. +# +# A flat table is simple set of key and value pairs. The value is a +# segment id of the next segmented table. The next table is referenced +# using the segment id and the next byte of a code. +# +# flat table (b2idx, b3idx1, b4idx1) +# { +# attr => { +# segmented => true(1) if this index is segmented> +# min => <minimum value of index key> +# max => <maximum value of index key> +# nextidx => <hash reference to the next level table> +# } +# i => { # index data +# <byte> => <pointer value> # pointer to the next index +# ... +# } +# +# Each segments in segmented table is equivalent to a flat table +# above. +# +# segmented table (csegs, b3idx2, b4idx2, b4idx3) +# { +# attr => { +# segmented => true(1) if this index is segmented> +# min => <minimum value of index key> +# max => <maximum value of index key> +# width => <required hex width only for cseg table> +# is32bit => true if values are 32bit width, false means 16bit. +# has0page => only for cseg. true if 0 page is for single byte chars +# next => <hash reference to the next level table, if any> +# } +# i => { # segment data +# <segid> => { # key for this segment +# lower => <minimum value> +# upper => <maximum value> +# offset => <position of this segment in the whole table> +# label => <label string of this segment> +# d => { # segment data +# <byte> => { # pointer to the next index +# label => <label string for this item> +# segid => <target segid of next level> +# segoffset => <offset of the target segid> +# } +# ... +# } +# } +# } +# } + +sub generate_index +{ + my ($c) = @_; + my (%csegs, %b2idx, %b3idx1, %b3idx2, %b4idx1, %b4idx2, %b4idx3); + my @all_tables = + (\%csegs, \%b2idx, \%b3idx1, \%b3idx2, \%b4idx1, \%b4idx2, \%b4idx3); + my $si; + + # initialize attributes of index tables + #<<< do not let perltidy touch this + $csegs{attr} = {name => "csegs", chartbl => 1, segmented => 1, + is32bit => 0, has0page => 0}; + #>>> + $csegs{attr} = { + name => "csegs", + chartbl => 1, + segmented => 1, + is32bit => 0, + has0page => 0 }; + $b2idx{attr} = { name => "b2idx", segmented => 0, nextidx => \%csegs }; + $b3idx1{attr} = { name => "b3idx1", segmented => 0, nextidx => \%b3idx2 }; + $b3idx2{attr} = { name => "b3idx2", segmented => 1, nextidx => \%csegs }; + $b4idx1{attr} = { name => "b4idx1", segmented => 0, nextidx => \%b4idx2 }; + $b4idx2{attr} = { name => "b4idx2", segmented => 1, nextidx => \%b4idx3 }; + $b4idx3{attr} = { name => "b4idx3", segmented => 1, nextidx => \%csegs }; + + foreach my $in (keys %$c) + { + if ($in < 0x100) + { + my $b1 = $in; + + # 1 byte code doesn't have index. the first segment #0 of + # character table stores them + $csegs{attr}{has0page} = 1; + $si = { + segid => 0, + off => $in, + label => "1byte-", + char => $c->{$in} }; + } + elsif ($in < 0x10000) + { + # 2-byte code index consists of just one flat table + my $b1 = $in >> 8; + my $b2 = $in & 0xff; + my $csegid = $in >> 8; + + if (!defined $b2idx{i}{$b1}) + { + &set_min_max($b2idx{attr}, $b1); + $b2idx{i}{$b1}{segid} = $csegid; + } + $si = { + segid => $csegid, + off => $b2, + label => sprintf("%02x", $b1), + char => $c->{$in} }; + } + elsif ($in < 0x1000000) + { + # 3-byte code index consists of one flat table and one + # segmented table + my $b1 = $in >> 16; + my $b2 = ($in >> 8) & 0xff; + my $b3 = $in & 0xff; + my $l1id = $in >> 16; + my $csegid = $in >> 8; + + if (!defined $b3idx1{i}{$b1}) + { + &set_min_max($b3idx1{attr}, $b1); + $b3idx1{i}{$b1}{segid} = $l1id; + } + if (!defined $b3idx2{i}{$l1id}{d}{$b2}) + { + &set_min_max($b3idx2{attr}, $b2); + $b3idx2{i}{$l1id}{label} = sprintf("%02x", $b1); + $b3idx2{i}{$l1id}{d}{$b2} = { + segid => $csegid, + label => sprintf("%02x%02x", $b1, $b2) }; + } + + $si = { + segid => $csegid, + off => $b3, + label => sprintf("%02x%02x", $b1, $b2), + char => $c->{$in} }; + } + elsif ($in < 0x100000000) + { + # 4-byte code index consists of one flat table, and two + # segmented tables + my $b1 = $in >> 24; + my $b2 = ($in >> 16) & 0xff; + my $b3 = ($in >> 8) & 0xff; + my $b4 = $in & 0xff; + my $l1id = $in >> 24; + my $l2id = $in >> 16; + my $csegid = $in >> 8; + + if (!defined $b4idx1{i}{$b1}) + { + &set_min_max($b4idx1{attr}, $b1); + $b4idx1{i}{$b1}{segid} = $l1id; + } + + if (!defined $b4idx2{i}{$l1id}{d}{$b2}) + { + &set_min_max($b4idx2{attr}, $b2); + $b4idx2{i}{$l1id}{d}{$b2} = { + segid => $l2id, + label => sprintf("%02x", $b1) }; + } + if (!defined $b4idx3{i}{$l2id}{d}{$b3}) + { + &set_min_max($b4idx3{attr}, $b3); + $b4idx3{i}{$l2id}{d}{$b3} = { + segid => $csegid, + label => sprintf("%02x%02x", $b1, $b2) }; + } + + $si = { + segid => $csegid, + off => $b4, + label => sprintf("%02x%02x%02x", $b1, $b2, $b3), + char => $c->{$in} }; + } + else + { + die sprintf("up to 4 byte code is supported: %x", $in); + } + + &set_min_max($csegs{attr}, $si->{off}); + $csegs{i}{ $si->{segid} }{d}{ $si->{off} } = $si->{char}; + $csegs{i}{ $si->{segid} }{label} = $si->{label}; + $csegs{attr}{is32bit} = 1 if ($si->{char} >= 0x10000); + &update_width($csegs{attr}, $si->{char}); + if ($si->{char} >= 0x100000000) + { + die "character width is over 32bit. abort."; + } + } + + # calcualte segment attributes + foreach my $t (@all_tables) + { + next if (!defined $t->{i} || !$t->{attr}{segmented}); + + # segments are to be aligned in the numerical order of segment id + my @keylist = sort { $a <=> $b } keys $t->{i}; + next if ($#keylist < 0); + my $offset = 1; + my $segsize = $t->{attr}{max} - $t->{attr}{min} + 1; + + for my $k (@keylist) + { + my $seg = $t->{i}{$k}; + $seg->{lower} = $t->{attr}{min}; + $seg->{upper} = $t->{attr}{max}; + $seg->{offset} = $offset; + $offset += $segsize; + } + + # overlapping successive zeros between segments + &overlap_segments($t); + } + + # make link among tables + foreach my $t (@all_tables) + { + &make_index_link($t, $t->{attr}{nextidx}); + } + + return { + name_prefix => "", + csegs => \%csegs, + b2idx => [ \%b2idx ], + b3idx => [ \%b3idx1, \%b3idx2 ], + b4idx => [ \%b4idx1, \%b4idx2, \%b4idx3 ], + all => \@all_tables }; +} + + +######################################### +# set_min_max - internal routine to maintain min and max value of a table +sub set_min_max +{ + my ($a, $v) = @_; + + $a->{min} = $v if (!defined $a->{min} || $v < $a->{min}); + $a->{max} = $v if (!defined $a->{max} || $v > $a->{max}); +} + +######################################### +# set_maxval - internal routine to maintain mixval +sub update_width +{ + my ($a, $v) = @_; + + my $nnibbles = int((int(log($v) / log(16)) + 1) / 2) * 2; + if (!defined $a->{width} || $nnibbles > $a->{width}) + { + $a->{width} = $nnibbles; + } +} + +######################################### +# overlap_segments +# +# removes duplicate regeion between two successive segments. + +sub overlap_segments +{ + my ($h) = @_; + + # don't touch if undefined + return if (!defined $h->{i} || !$h->{attr}{segmented}); + my $index = $h->{i}; + my ($min, $max) = ($h->{attr}{min}, $h->{attr}{max}); + my ($prev, $first); + my @segids = sort { $a <=> $b } keys $index; + return if ($#segids < 1); + + $first = 1; + undef $prev; + + for my $segid (@segids) + { + my $seg = $index->{$segid}; + + # smin and smax is range excluded preceeding and trailing zeros + my @keys = sort { $a <=> $b } keys $seg->{d}; + my $smin = $keys[0]; + my $smax = $keys[-1]; + + if ($first) + { + # first segment doesn't have a preceding segment + $seg->{offset} = 1; + $seg->{lower} = $min; + $seg->{upper} = $smax; + } + else + { + # calculate overlap and shift segment location + my $prefix = $smin - $min; + my $postfix = $max - $smax; + my $prevpostfix = $max - $prev->{upper}; + my $overlap = $prevpostfix < $prefix ? $prevpostfix : $prefix; + + $seg->{lower} = $min + $overlap; + $seg->{upper} = $smax; + $seg->{offset} = $prev->{offset} + ($max - $min + 1) - $overlap; + $prev->{upper} = $max; + } + $prev = $seg; + $first = 0; + } + + return $h; +} + +###################################################### +# make_index_link(from_table, to_table) +# +# Fills out target pointers in non-leaf index tables. +# +# from_table - table to set links +# to_table - target table of from_table + +sub make_index_link +{ + my ($s, $t) = @_; + return if (!defined $s->{i} || !defined $t->{i}); + + my @tkeys = sort { $a <=> $b } keys $t->{i}; + + if ($s->{attr}{segmented}) + { + foreach my $k1 (keys $s->{i}) + { + foreach my $k2 (keys $s->{i}{$k1}{d}) + { + my $tsegid = $s->{i}{$k1}{d}{$k2}{segid}; + if (!defined $tsegid) + { + die sprintf( + "segid is not set in %s{i}{%x}{d}{%x}{segid}", + $s->{attr}{name}, + $k1, $k2); + } + $s->{i}{$k1}{d}{$k2}{segoffset} = $t->{i}{$tsegid}{offset}; + } + } + } + else + { + foreach my $k (keys $s->{i}) + { + my $tsegid = $s->{i}{$k}{segid}; + if (!defined $tsegid) + { + die sprintf("segid is not set in %s{i}{%x}{segid}", + $s->{attr}{name}, $k); + } + $s->{i}{$k}{segoffset} = $t->{i}{$tsegid}{offset}; + } + } +} + +############################################### +# print_radix_table - output index table as C-struct +# +# print_radix_table(hd, table, tblname, width) +# returns 1 if the table is written +# +# hd - file handle to write +# table - ref to an index table +# tblname - C symbol name for the table +# width - width in characters of this table + +sub print_radix_table +{ + my ($hd, $table, $tblname, $width) = @_; + + return 0 if (!defined $table->{i}); + + if ($table->{attr}{chartbl}) + { + &print_chars_table($hd, $table, $tblname, $width); + } + elsif ($table->{attr}{segmented}) + { + &print_segmented_table($hd, $table, $tblname, $width); + } + else + { + &print_flat_table($hd, $table, $tblname, $width); + } + return 1; +} + +######################################### +# print_chars_table +# +# print_chars_table(hd, table, tblname, width) +# this is usually called via writ_table +# +# hd - file handle to write +# table - ref to an index table +# tblname - C symbol name for the table +# tblwidth- width in characters of this table + +sub print_chars_table +{ + my ($hd, $table, $tblname, $width) = @_; + my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max}); + my ($type) = $table->{attr}{is32bit} ? "uint32" : "uint16"; + + printf $hd "static const %s %s[] =\n{", $type, $tblname; + printf $hd " /* chars content - index range = [%02x, %02x] */", $st, $ed; + + # values in character table are written in fixedwidth + # hexadecimals. calculate the number of columns in a line. 13 is + # the length of line header. + + my $colwidth = $table->{attr}{width}; + my $colseplen = 4; # the length of ", 0x" + my $headerlength = 13; + my $colnum = int(($width - $headerlength) / ($colwidth + $colseplen)); + + # round down to multiples of 4. don't bother by too small table width + my $colnum = int($colnum / 4) * 4; + my $line = ""; + my $first0 = 1; + + # output all segments in segment id order + foreach my $k (sort { $a <=> $b } keys $table->{i}) + { + my $s = $table->{i}{$k}; + if (!$first0) + { + $line =~ s/\s+$//; # remove trailing space + print $hd $line, ",\n"; + $line = ""; + } + $first0 = 0; + + # write segment header + printf $hd "\n /*** %4sxx - offset 0x%05x ***/", + $s->{label}, $s->{offset}; + + # write segment content + my $first1 = 1; + my ($segstart, $segend) = ($s->{lower}, $s->{upper}); + my ($xpos, $nocomma) = (0, 0); + + foreach my $j (($segstart - ($segstart % $colnum)) .. $segend) + { + $line .= "," if (!$first1 && !$nocomma); + + # write the previous line and put a line header for the + # new line if this is the first time or this line is full + if ($xpos >= $colnum || $first1) + { + $line =~ s/\s+$//; # remove trailing space + print $hd $line, "\n"; + $line = sprintf(" /* %02x */ ", $j); + $xpos = 0; + } + else + { + $line .= " "; + } + $first1 = 0; + + # write each column + if ($j >= $segstart) + { + $line .= sprintf("0x%0*x", $colwidth, $s->{d}{$j}); + $nocomma = 0; + } + else + { + # adjust column position + $line .= " " x ($colwidth + 3); + $nocomma = 1; + } + $xpos++; + } + + } + + $line =~ s/\s+$//; + print $hd $line, "\n};\n"; +} + +###################################################### +# print_flat_table - output nonsegmented index table +# +# print_flat_table(hd, table, tblname, width) +# this is usually called via writ_table +# +# hd - file handle to write +# table - ref to an index table +# tblname - C symbol name for the table +# width - width in characters of this table + +sub print_flat_table +{ + my ($hd, $table, $tblname, $width) = @_; + my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max}); + + print $hd "static const $radix_node_type $tblname =\n{"; + printf $hd "\n 0x%x, 0x%x, /* table range */\n", $st, $ed; + print $hd " {"; + + my $first = 1; + my $line = ""; + + foreach my $i ($st .. $ed) + { + $line .= "," if (!$first); + my $newitem = sprintf("%d", + defined $table->{i}{$i} ? $table->{i}{$i}{segoffset} : 0); + + # flush current line and feed a line if the current line + # exceeds a limit + if ($first || length($line . $newitem) > $width) + { + $line =~ s/\s+$//; # remove trailing space + print $hd "$line\n"; + $line = " "; + } + else + { + $line .= " "; + } + $line .= $newitem; + $first = 0; + } + print $hd $line; + print $hd "\n }\n};\n"; +} + +###################################################### +# print_segmented_table - output segmented index table +# +# print_segmented_table(hd, table, tblname, width) +# this is usually called via writ_table +# +# hd - file handle to write +# table - ref to an index table +# tblname - C symbol name for the table +# width - width in characters of this table + +sub print_segmented_table +{ + my ($hd, $table, $tblname, $width) = @_; + my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max}); + + # write the variable definition + print $hd "static const $radix_node_type $tblname =\n{"; + printf $hd "\n 0x%02x, 0x%02x, /*index range */\n {", $st, $ed; + + my $first0 = 1; + foreach my $k (sort { $a <=> $b } keys $table->{i}) + { + print $hd ",\n" if (!$first0); + $first0 = 0; + printf $hd "\n /*** %sxxxx - offset 0x%05x ****/", + $table->{i}{$k}{label}, $table->{i}{$k}{offset}; + + my $segstart = $table->{i}{$k}{lower}; + my $segend = $table->{i}{$k}{upper}; + + my $line = ""; + my $first1 = 1; + my $newitem = ""; + + foreach my $j ($segstart .. $segend) + { + $line .= "," if (!$first1); + $newitem = sprintf("%d", + $table->{i}{$k}{d}{$j} + ? $table->{i}{$k}{d}{$j}{segoffset} + : 0); + + if ($first1 || length($line . $newitem) > $width) + { + $line =~ s/\s+$//; + print $hd "$line\n"; + $line = + sprintf(" /* %2s%02x */ ", $table->{i}{$k}{label}, $j); + } + else + { + $line .= " "; + } + $line .= $newitem; + $first1 = 0; + } + print $hd $line; + } + print $hd "\n }\n};\n"; +} + +######################################### +# make_table_refname(table, prefix) +# +# internal routine to make C reference notation for tables + +sub make_table_refname +{ + my ($table, $prefix) = @_; + + return "NULL" if (!defined $table->{i}); + return "&" . $prefix . $table->{attr}{name}; +} + +######################################### +# print_radix_main(hd, tblname, trie, name_prefix) +# +# write main radix tree table +# +# hd - file handle to write this table +# tblname - variable name of this struct +# trie - ref to a radix tree +# name_prefix- prefix for subtables. + +sub print_radix_main +{ + my ($hd, $tblname, $trie, $name_prefix) = @_; + my $ctblname = $name_prefix . $trie->{csegs}{attr}{name}; + my ($ctbl16name, $ctbl32name); + if ($trie->{csegs}{attr}{is32bit}) + { + $ctbl16name = "NULL"; + $ctbl32name = $ctblname; + } + else + { + $ctbl16name = $ctblname; + $ctbl32name = "NULL"; + } + + my $b2iname = make_table_refname($trie->{b2idx}[0], $name_prefix); + my $b3i1name = make_table_refname($trie->{b3idx}[0], $name_prefix); + my $b3i2name = make_table_refname($trie->{b3idx}[1], $name_prefix); + my $b4i1name = make_table_refname($trie->{b4idx}[0], $name_prefix); + my $b4i2name = make_table_refname($trie->{b4idx}[1], $name_prefix); + my $b4i3name = make_table_refname($trie->{b4idx}[2], $name_prefix); + + #<<< do not let perltidy touch this + print $hd "static const $radix_type $tblname =\n{\n"; + print $hd " /* final character table offset and body */\n"; + printf $hd " 0x%x, 0x%x, %s, %s, %s,\n", + $trie->{csegs}{attr}{min}, $trie->{csegs}{attr}{max}, + $trie->{csegs}{attr}{has0page} ? 'true' : 'false', + $ctbl16name, $ctbl32name; + + print $hd " /* 2-byte code table */\n"; + print $hd " $b2iname,\n"; + print $hd " /* 3-byte code tables */\n"; + print $hd " {$b3i1name, $b3i2name},\n"; + print $hd " /* 4-byte code table */\n"; + print $hd " {$b4i1name, $b4i2name, $b4i3name},\n"; + print $hd "};\n"; + #>>> +} + +###################################################### +# make_charmap - convert charset table to charmap hash +# with checking duplicate source code +# +# make_charmap(\@charset, $direction) +# charset - ref to charset table : see print_tables +# direction - conversion direction + +sub make_charmap +{ + my ($charset, $direction) = @_; + + die "unacceptable direction : $direction" + if ($direction ne "to_unicode" && $direction ne "from_unicode"); + + my %charmap; + foreach my $c (@$charset) + { + next if ($c->{direction} ne $direction && $c->{direction} ne "both"); + + # don't generate entries for combined characters + next if (defined $c->{ucs_second}); + + my ($src, $dst) = + $direction eq "to_unicode" + ? ($c->{code}, $c->{ucs}) + : ($c->{ucs}, $c->{code}); + + if (defined $c->{$src}) + { + printf STDERR + "Error: duplicate source code: 0x%04x => 0x%04x, 0x%04x\n", + $src, $c->{$src}, $dst; + exit; + } + if ($direction eq "to_unicode") + { + $charmap{$src} = ucs2utf($dst); + } + else + { + $charmap{ ucs2utf($src) } = $dst; + } + + } + + return \%charmap; +} + + +######################################### +# print_radix_map - write the whole content of C source of tadix tree +# +# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth) +# +# this_script - the name of the *caller script* of this feature +# csname - character set name other than ucs +# direction - desired direction "to_unicode" or "from_unicode" +# charset - ref to character set array +# tblwidth - width in characters of output source file + +sub print_radix_map +{ + my ($this_script, $csname, $direction, $charset, $tblwidth) = @_; + + my $charmap = &make_charmap($charset, $direction); + my $trie = &generate_index($charmap); + my $fname = + $direction eq "to_unicode" + ? lc("${csname}_to_utf8_radix.map") + : lc("utf8_to_${csname}_radix.map"); + + my $tblname = lc("${csname}_${direction}_tree"); + my $name_prefix = lc("${csname}_${direction}_"); + + if ($direction eq "to_unicode") + { + print "- Writing ${csname}=>UTF8 conversion radix index: $fname\n"; + } + else + { + print "- Writing UTF8=>${csname} conversion radix index: $fname\n"; + } + + open(my $out, '>', $fname) || die("cannot open $fname"); + + print $out "/* src/backend/utils/mb/Unicode/$fname */\n" + . "/* This file is generated by $this_script */\n\n"; + + foreach my $t (@{ $trie->{all} }) + { + my $table_name = $name_prefix . $t->{attr}{name}; + + if (&print_radix_table($out, $t, $table_name, $tblwidth)) + { + print $out "\n"; + } + } + + &print_radix_main($out, $tblname, $trie, $name_prefix); + close($out); +} + + +################################################################### +# print_radix_trees - write the radix tree files for both direction +# +# print_radix_trees($this_script, $csname, \%charset) +# +# this_script - the name of the *caller script* of this feature +# csname - character set name other than ucs +# charset - ref to character set array +sub print_radix_trees +{ + my ($this_script, $csname, $charset) = @_; + + &print_radix_map($this_script, $csname, "from_unicode", $charset, 78); + &print_radix_map($this_script, $csname, "to_unicode", $charset, 78); +} + +sub dump_charset +{ + my ($list, $filt) = @_; + + foreach my $i (@$list) + { + next if (defined $filt && !&$filt($i)); + if (!defined $i->{ucs}) { $i->{ucs} = &utf2ucs($i->{utf8}); } + printf "ucs=%x, code=%x, direction=%s %s:%d %s\n", + $i->{ucs}, $i->{code}, $i->{direction}, + $i->{f}, $i->{l}, $i->{comment}; + } +} + 1; diff --git a/src/backend/utils/mb/Unicode/download_srctxts.sh b/src/backend/utils/mb/Unicode/download_srctxts.sh new file mode 100755 index 0000000..572d57e --- /dev/null +++ b/src/backend/utils/mb/Unicode/download_srctxts.sh @@ -0,0 +1,127 @@ +#! /bin/bash + +# This script downloads conversion source files from URLs as of 2016/10/27 +# These source files may removed or changed without notice +if [ ! -e CP932.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT +fi +if [ ! -e JIS0201.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT +fi +if [ ! -e JIS0208.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT +fi +if [ ! -e JIS0212.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT +fi +if [ ! -e SHIFTJIS.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT +fi +if [ ! -e CP866.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT +fi +if [ ! -e CP874.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT +fi +if [ ! -e CP936.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT +fi +if [ ! -e CP950.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT +fi +if [ ! -e CP1250.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT +fi +if [ ! -e CP1251.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT +fi +if [ ! -e CP1252.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT +fi +if [ ! -e CP1253.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1253.TXT +fi +if [ ! -e CP1254.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1254.TXT +fi +if [ ! -e CP1255.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1255.TXT +fi +if [ ! -e CP1256.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT +fi +if [ ! -e CP1257.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1257.TXT +fi +if [ ! -e CP1258.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1258.TXT +fi +if [ ! -e 8859-2.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT +fi +if [ ! -e 8859-3.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-3.TXT +fi +if [ ! -e 8859-4.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-4.TXT +fi +if [ ! -e 8859-5.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-5.TXT +fi +if [ ! -e 8859-6.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-6.TXT +fi +if [ ! -e 8859-7.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-7.TXT +fi +if [ ! -e 8859-8.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-8.TXT +fi +if [ ! -e 8859-9.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-9.TXT +fi +if [ ! -e 8859-10.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-10.TXT +fi +if [ ! -e 8859-13.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-13.TXT +fi +if [ ! -e 8859-14.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-14.TXT +fi +if [ ! -e 8859-15.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT +fi +if [ ! -e 8859-16.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-16.TXT +fi +if [ ! -e KOI8-R.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT +fi +if [ ! -e KOI8-U.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-U.TXT +fi +if [ ! -e CNS11643.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT +fi +if [ ! -e KSX1001.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT +fi +if [ ! -e JOHAB.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT +fi +if [ ! -e BIG5.TXT ]; then + wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT +fi +if [ ! -e windows-949-2000.xml ]; then + wget http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/windows-949-2000.xml +fi +if [ ! -e gb-18030-2000.xml ]; then + wget http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml +fi +if [ ! -e sjis-0213-2004-std.txt ]; then + wget http://x0213.org/codetable/sjis-0213-2004-std.txt +fi +if [ ! -e euc-jis-2004-std.txt ]; then + wget http://x0213.org/codetable/euc-jis-2004-std.txt +fi diff --git a/src/backend/utils/mb/Unicode/make_mapchecker.pl b/src/backend/utils/mb/Unicode/make_mapchecker.pl new file mode 100755 index 0000000..6de0b3f --- /dev/null +++ b/src/backend/utils/mb/Unicode/make_mapchecker.pl @@ -0,0 +1,71 @@ +#! /usr/bin/perl + +use strict; + +opendir(my $dh, ".") || die "failed to open directory: ."; +my @radixmaps = grep { /_radix\.map$/ } readdir($dh); +closedir($dh); + +my %plainmaps; + +# check if all radix maps has corresponding plain map +foreach my $rmap (@radixmaps) +{ + my $pmap = $rmap; + $pmap =~ s/_radix//; + if (!-e $pmap) + { + die("radix map \"$rmap\" has no corresponding plain map\n"); + } + $plainmaps{$rmap} = $pmap; +} + +# Generate sanity checker source +my $out; +open($out, '>', "map_checker.h") + || die "cannot open file to write: map_checker.h"; +foreach my $i (sort @radixmaps) +{ + print $out "#include \"$i\"\n"; + print $out "#include \"$plainmaps{$i}\"\n"; +} + +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps; + +print $out <<'EOF'; + +struct mappair +{ + const char *name; + int len; + const pg_local_to_utf *lu; + const pg_utf_to_local *ul; + const pg_mb_radix_tree *rt; +} mappairs[] = { +EOF + +foreach my $m (@mapnames) +{ + if ($m =~ /^utf8_to_(.*)$/) + { + my $e = uc($1); + print $out +" {\"$m\", lengthof(ULmap$e), NULL, ULmap$e, &$1_from_unicode_tree}"; + } + elsif ($m =~ /^(.*)_to_utf8$/) + { + my $e = uc($1); + print $out + " {\"$m\", lengthof(LUmap$e), LUmap$e, NULL, &$1_to_unicode_tree}"; + } + else + { + die "Unrecognizable map name: $m"; + } + print $out ",\n"; +} + +print $out " {NULL, 0, NULL, NULL, NULL}\n};\n"; + +close($out); + diff --git a/src/backend/utils/mb/Unicode/map_checker.c b/src/backend/utils/mb/Unicode/map_checker.c new file mode 100644 index 0000000..643ac10 --- /dev/null +++ b/src/backend/utils/mb/Unicode/map_checker.c @@ -0,0 +1,147 @@ +#include "postgres.h" +#include "mb/pg_wchar.h" + +#define lengthof(array) (sizeof (array) / sizeof ((array)[0])) + +#include "map_checker.h" + +/* + * radix tree conversion function - this should be identical to the function in + * ../conv.c with the same name + */ +const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c) +{ + uint32 off = 0; + uint32 b1 = c >> 24; + uint32 b2 = (c >> 16) & 0xff; + uint32 b3 = (c >> 8) & 0xff; + uint32 b4 = c & 0xff; + + if (b1 > 0) + { + /* 4-byte code */ + uint32 idx; + + /* check code validity - fist byte */ + if (rt->b4idx[0] == NULL || + b1 < rt->b4idx[0]->lower || b1 > rt->b4idx[0]->upper) + return 0; + + idx = b1 - rt->b4idx[0]->lower; + off = rt->b4idx[0]->idx[idx]; + if (off == 0) + return 0; + + /* check code validity - second byte */ + if (b2 < rt->b4idx[1]->lower || b2 > rt->b4idx[1]->upper) + return 0; + + idx = b2 - rt->b4idx[1]->lower; + off = (rt->b4idx[1]->idx + off - 1)[idx]; + if (off == 0) + return 0; + + /* check code validity - third byte */ + if (b3 < rt->b4idx[2]->lower || b3 > rt->b4idx[2]->upper) + return 0; + + idx = b3 - rt->b4idx[2]->lower; + off = (rt->b4idx[2]->idx + off - 1)[idx]; + } + else if (b2 > 0) + { + /* 3-byte code */ + + uint32 idx; + + /* check code validity - first byte */ + if (rt->b3idx[0] == NULL || + b2 < rt->b3idx[0]->lower || b2 > rt->b3idx[0]->upper) + return 0; + + idx = b2 - rt->b3idx[0]->lower; + off = rt->b3idx[0]->idx[idx]; + if (off == 0) + return 0; + + /* check code validity - second byte */ + if (b3 < rt->b3idx[1]->lower || b3 > rt->b3idx[1]->upper) + return 0; + + idx = b3 - rt->b3idx[1]->lower; + off = (rt->b3idx[1]->idx + off - 1)[idx]; + } + else if (b3 > 0) + { + /* 2-byte code */ + uint32 idx; + + /* check code validity - first byte */ + if (rt->b2idx == NULL || + b3 < rt->b2idx->lower || b3 > rt->b2idx->upper) + return 0; + + idx = b3 - rt->b2idx->lower; + off = rt->b2idx->idx[idx]; + } + else + { + if (rt->single_byte) + off = 1; + } + + if (off == 0) + return 0; + + /* check code validity - last byte */ + if (b4 < rt->chars_lower || b4 > rt->chars_upper) + return 0; + + if (rt->chars32) + return (rt->chars32 + off - 1)[b4 - rt->chars_lower]; + else + return (rt->chars16 + off - 1)[b4 - rt->chars_lower]; +} + +int main(void) +{ + struct mappair *mp; + + for (mp = mappairs ; mp->name ; mp++) + { + int i; + + printf("Checking \"%s_radix.map\" against \"%s.map\"(%d chars)..", mp->name, mp->name, mp->len); + for (i = 0 ; i < mp->len ; i++) + { + uint32 s, c, d; + + if (mp->ul) + { + s = mp->ul[i].utf; + d = mp->ul[i].code; + } + else + { + s = mp->lu[i].code; + d = mp->lu[i].utf; + } + if (s < 0x80) + { + fprintf(stderr, "\nASCII character ? (%x)", s); + exit(1); + } + + c = pg_mb_radix_conv(mp->rt, s); + + if (c != d) + { + fprintf(stderr, "\nConversion failure in \"%s\": %x => %x, expected %x\n", + mp->name, s, c, d); + exit(1); + } + } + printf("Ok.\n"); + } + printf("All radix trees are perfect!\n"); +} diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c index d50336b..d4fab1f 100644 --- a/src/backend/utils/mb/conv.c +++ b/src/backend/utils/mb/conv.c @@ -364,6 +364,103 @@ store_coded_char(unsigned char *dest, uint32 code) } /* + * radix tree conversion function + */ +const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c) +{ + uint32 off = 0; + uint32 b1 = c >> 24; + uint32 b2 = (c >> 16) & 0xff; + uint32 b3 = (c >> 8) & 0xff; + uint32 b4 = c & 0xff; + + if (b1 > 0) + { + /* 4-byte code */ + uint32 idx; + + /* check code validity - fist byte */ + if (rt->b4idx[0] == NULL || + b1 < rt->b4idx[0]->lower || b1 > rt->b4idx[0]->upper) + return 0; + + idx = b1 - rt->b4idx[0]->lower; + off = rt->b4idx[0]->idx[idx]; + if (off == 0) + return 0; + + /* check code validity - second byte */ + if (b2 < rt->b4idx[1]->lower || b2 > rt->b4idx[1]->upper) + return 0; + + idx = b2 - rt->b4idx[1]->lower; + off = (rt->b4idx[1]->idx + off - 1)[idx]; + if (off == 0) + return 0; + + /* check code validity - third byte */ + if (b3 < rt->b4idx[2]->lower || b3 > rt->b4idx[2]->upper) + return 0; + + idx = b3 - rt->b4idx[2]->lower; + off = (rt->b4idx[2]->idx + off - 1)[idx]; + } + else if (b2 > 0) + { + /* 3-byte code */ + + uint32 idx; + + /* check code validity - first byte */ + if (rt->b3idx[0] == NULL || + b2 < rt->b3idx[0]->lower || b2 > rt->b3idx[0]->upper) + return 0; + + idx = b2 - rt->b3idx[0]->lower; + off = rt->b3idx[0]->idx[idx]; + if (off == 0) + return 0; + + /* check code validity - second byte */ + if (b3 < rt->b3idx[1]->lower || b3 > rt->b3idx[1]->upper) + return 0; + + idx = b3 - rt->b3idx[1]->lower; + off = (rt->b3idx[1]->idx + off - 1)[idx]; + } + else if (b3 > 0) + { + /* 2-byte code */ + uint32 idx; + + /* check code validity - first byte */ + if (rt->b2idx == NULL || + b3 < rt->b2idx->lower || b3 > rt->b2idx->upper) + return 0; + + idx = b3 - rt->b2idx->lower; + off = rt->b2idx->idx[idx]; + } + else + { + if (rt->single_byte) + off = 1; + } + + if (off == 0) + return 0; + + /* check code validity - last byte */ + if (b4 < rt->chars_lower || b4 > rt->chars_upper) + return 0; + + if (rt->chars32) + return (rt->chars32 + off - 1)[b4 - rt->chars_lower]; + else + return (rt->chars16 + off - 1)[b4 - rt->chars_lower]; +} + +/* * UTF8 ---> local code * * utf: input string in UTF8 encoding (need not be null-terminated) @@ -389,8 +486,8 @@ store_coded_char(unsigned char *dest, uint32 code) void UtfToLocal(const unsigned char *utf, int len, unsigned char *iso, - const pg_utf_to_local *map, int mapsize, - const pg_utf_to_local_combined *cmap, int cmapsize, + const void *map, int mapsize, + const void *cmap, int cmapsize, utf_local_conversion_func conv_func, int encoding) { @@ -516,13 +613,26 @@ UtfToLocal(const unsigned char *utf, int len, } /* Now check ordinary map */ - p = bsearch(&iutf, map, mapsize, - sizeof(pg_utf_to_local), compare1); + if (mapsize > 0) + { + p = bsearch(&iutf, map, mapsize, + sizeof(pg_utf_to_local), compare1); - if (p) + if (p) + { + iso = store_coded_char(iso, p->code); + continue; + } + } + else if (map) { - iso = store_coded_char(iso, p->code); - continue; + uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree *)map, + iutf); + if (converted) + { + iso = store_coded_char(iso, converted); + continue; + } } /* if there's a conversion function, try that */ @@ -575,8 +685,8 @@ UtfToLocal(const unsigned char *utf, int len, void LocalToUtf(const unsigned char *iso, int len, unsigned char *utf, - const pg_local_to_utf *map, int mapsize, - const pg_local_to_utf_combined *cmap, int cmapsize, + const void *map, int mapsize, + const void *cmap, int cmapsize, utf_local_conversion_func conv_func, int encoding) { @@ -635,26 +745,39 @@ LocalToUtf(const unsigned char *iso, int len, iiso = 0; /* keep compiler quiet */ } - /* First check ordinary map */ - p = bsearch(&iiso, map, mapsize, - sizeof(pg_local_to_utf), compare2); - - if (p) + if (mapsize > 0) { - utf = store_coded_char(utf, p->utf); - continue; - } + /* First check ordinary map */ + p = bsearch(&iiso, map, mapsize, + sizeof(pg_local_to_utf), compare2); + + if (p) + { + utf = store_coded_char(utf, p->utf); + continue; + } - /* If there's a combined character map, try that */ - if (cmap) + /* If there's a combined character map, try that */ + if (cmap) + { + cp = bsearch(&iiso, cmap, cmapsize, + sizeof(pg_local_to_utf_combined), compare4); + + if (cp) + { + utf = store_coded_char(utf, cp->utf1); + utf = store_coded_char(utf, cp->utf2); + continue; + } + } + } + else if (map) { - cp = bsearch(&iiso, cmap, cmapsize, - sizeof(pg_local_to_utf_combined), compare4); + uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree*)map, iiso); - if (cp) + if (converted) { - utf = store_coded_char(utf, cp->utf1); - utf = store_coded_char(utf, cp->utf2); + utf = store_coded_char(utf, converted); continue; } } diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c b/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c index 3d71167..2857228 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/big5_to_utf8.map" -#include "../../Unicode/utf8_to_big5.map" +#include "../../Unicode/big5_to_utf8_radix.map" +#include "../../Unicode/utf8_to_big5_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ big5_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_BIG5, PG_UTF8); LocalToUtf(src, len, dest, - LUmapBIG5, lengthof(LUmapBIG5), + &big5_to_unicode_tree, 0, NULL, 0, NULL, PG_BIG5); @@ -60,7 +60,7 @@ utf8_to_big5(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_BIG5); UtfToLocal(src, len, dest, - ULmapBIG5, lengthof(ULmapBIG5), + &big5_from_unicode_tree, 0, NULL, 0, NULL, PG_BIG5); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c b/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c index 6e2be74..f61f86b 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c @@ -14,10 +14,10 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/utf8_to_koi8r.map" -#include "../../Unicode/koi8r_to_utf8.map" -#include "../../Unicode/utf8_to_koi8u.map" -#include "../../Unicode/koi8u_to_utf8.map" +#include "../../Unicode/utf8_to_koi8r_radix.map" +#include "../../Unicode/koi8r_to_utf8_radix.map" +#include "../../Unicode/utf8_to_koi8u_radix.map" +#include "../../Unicode/koi8u_to_utf8_radix.map" PG_MODULE_MAGIC; @@ -48,7 +48,7 @@ utf8_to_koi8r(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_KOI8R); UtfToLocal(src, len, dest, - ULmapKOI8R, lengthof(ULmapKOI8R), + &koi8r_from_unicode_tree, 0, NULL, 0, NULL, PG_KOI8R); @@ -66,7 +66,7 @@ koi8r_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_KOI8R, PG_UTF8); LocalToUtf(src, len, dest, - LUmapKOI8R, lengthof(LUmapKOI8R), + &koi8r_to_unicode_tree, 0, NULL, 0, NULL, PG_KOI8R); @@ -84,7 +84,7 @@ utf8_to_koi8u(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_KOI8U); UtfToLocal(src, len, dest, - ULmapKOI8U, lengthof(ULmapKOI8U), + &koi8u_from_unicode_tree, 0, NULL, 0, NULL, PG_KOI8U); @@ -102,7 +102,7 @@ koi8u_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_KOI8U, PG_UTF8); LocalToUtf(src, len, dest, - LUmapKOI8U, lengthof(LUmapKOI8U), + &koi8u_to_unicode_tree, 0, NULL, 0, NULL, PG_KOI8U); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c b/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c index 4d14b26..1ad3d03 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/euc_jis_2004_to_utf8.map" -#include "../../Unicode/utf8_to_euc_jis_2004.map" +#include "../../Unicode/euc_jis_2004_to_utf8_radix.map" +#include "../../Unicode/utf8_to_euc_jis_2004_radix.map" #include "../../Unicode/euc_jis_2004_to_utf8_combined.map" #include "../../Unicode/utf8_to_euc_jis_2004_combined.map" @@ -44,7 +44,7 @@ euc_jis_2004_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_JIS_2004, PG_UTF8); LocalToUtf(src, len, dest, - LUmapEUC_JIS_2004, lengthof(LUmapEUC_JIS_2004), + &euc_jis_2004_to_unicode_tree, 0, LUmapEUC_JIS_2004_combined, lengthof(LUmapEUC_JIS_2004_combined), NULL, PG_EUC_JIS_2004); @@ -62,7 +62,7 @@ utf8_to_euc_jis_2004(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_JIS_2004); UtfToLocal(src, len, dest, - ULmapEUC_JIS_2004, lengthof(ULmapEUC_JIS_2004), + &euc_jis_2004_from_unicode_tree, 0, ULmapEUC_JIS_2004_combined, lengthof(ULmapEUC_JIS_2004_combined), NULL, PG_EUC_JIS_2004); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c b/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c index 953123c..be1a036 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/euc_cn_to_utf8.map" -#include "../../Unicode/utf8_to_euc_cn.map" +#include "../../Unicode/euc_cn_to_utf8_radix.map" +#include "../../Unicode/utf8_to_euc_cn_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ euc_cn_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_CN, PG_UTF8); LocalToUtf(src, len, dest, - LUmapEUC_CN, lengthof(LUmapEUC_CN), + &euc_cn_to_unicode_tree, 0, NULL, 0, NULL, PG_EUC_CN); @@ -60,7 +60,7 @@ utf8_to_euc_cn(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_CN); UtfToLocal(src, len, dest, - ULmapEUC_CN, lengthof(ULmapEUC_CN), + &euc_cn_from_unicode_tree, 0, NULL, 0, NULL, PG_EUC_CN); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c b/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c index dd020d2..cc46003 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/euc_jp_to_utf8.map" -#include "../../Unicode/utf8_to_euc_jp.map" +#include "../../Unicode/euc_jp_to_utf8_radix.map" +#include "../../Unicode/utf8_to_euc_jp_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ euc_jp_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_JP, PG_UTF8); LocalToUtf(src, len, dest, - LUmapEUC_JP, lengthof(LUmapEUC_JP), + &euc_jp_to_unicode_tree, 0, NULL, 0, NULL, PG_EUC_JP); @@ -60,7 +60,7 @@ utf8_to_euc_jp(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_JP); UtfToLocal(src, len, dest, - ULmapEUC_JP, lengthof(ULmapEUC_JP), + &euc_jp_from_unicode_tree, 0, NULL, 0, NULL, PG_EUC_JP); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c b/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c index 7b5e04e..5e83522 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/euc_kr_to_utf8.map" -#include "../../Unicode/utf8_to_euc_kr.map" +#include "../../Unicode/euc_kr_to_utf8_radix.map" +#include "../../Unicode/utf8_to_euc_kr_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ euc_kr_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_KR, PG_UTF8); LocalToUtf(src, len, dest, - LUmapEUC_KR, lengthof(LUmapEUC_KR), + &euc_kr_to_unicode_tree, 0, NULL, 0, NULL, PG_EUC_KR); @@ -60,7 +60,7 @@ utf8_to_euc_kr(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_KR); UtfToLocal(src, len, dest, - ULmapEUC_KR, lengthof(ULmapEUC_KR), + &euc_kr_from_unicode_tree, 0, NULL, 0, NULL, PG_EUC_KR); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c b/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c index 023a279..dd3d791 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/euc_tw_to_utf8.map" -#include "../../Unicode/utf8_to_euc_tw.map" +#include "../../Unicode/euc_tw_to_utf8_radix.map" +#include "../../Unicode/utf8_to_euc_tw_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ euc_tw_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_TW, PG_UTF8); LocalToUtf(src, len, dest, - LUmapEUC_TW, lengthof(LUmapEUC_TW), + &euc_tw_to_unicode_tree, 0, NULL, 0, NULL, PG_EUC_TW); @@ -60,7 +60,7 @@ utf8_to_euc_tw(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_TW); UtfToLocal(src, len, dest, - ULmapEUC_TW, lengthof(ULmapEUC_TW), + &euc_tw_from_unicode_tree, 0, NULL, 0, NULL, PG_EUC_TW); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c b/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c index 5e8ec3d..3e3c74d 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/gb18030_to_utf8.map" -#include "../../Unicode/utf8_to_gb18030.map" +#include "../../Unicode/gb18030_to_utf8_radix.map" +#include "../../Unicode/utf8_to_gb18030_radix.map" PG_MODULE_MAGIC; @@ -197,7 +197,7 @@ gb18030_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_GB18030, PG_UTF8); LocalToUtf(src, len, dest, - LUmapGB18030, lengthof(LUmapGB18030), + &gb18030_to_unicode_tree, 0, NULL, 0, conv_18030_to_utf8, PG_GB18030); @@ -215,7 +215,7 @@ utf8_to_gb18030(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_GB18030); UtfToLocal(src, len, dest, - ULmapGB18030, lengthof(ULmapGB18030), + &gb18030_from_unicode_tree, 0, NULL, 0, conv_utf8_to_18030, PG_GB18030); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c b/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c index d6613a0..872f353 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/gbk_to_utf8.map" -#include "../../Unicode/utf8_to_gbk.map" +#include "../../Unicode/gbk_to_utf8_radix.map" +#include "../../Unicode/utf8_to_gbk_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ gbk_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_GBK, PG_UTF8); LocalToUtf(src, len, dest, - LUmapGBK, lengthof(LUmapGBK), + &gbk_to_unicode_tree, 0, NULL, 0, NULL, PG_GBK); @@ -60,7 +60,7 @@ utf8_to_gbk(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_GBK); UtfToLocal(src, len, dest, - ULmapGBK, lengthof(ULmapGBK), + &gbk_from_unicode_tree, 0, NULL, 0, NULL, PG_GBK); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c b/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c index 9204b5f..2361528 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c @@ -14,32 +14,32 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/iso8859_10_to_utf8.map" -#include "../../Unicode/iso8859_13_to_utf8.map" -#include "../../Unicode/iso8859_14_to_utf8.map" -#include "../../Unicode/iso8859_15_to_utf8.map" -#include "../../Unicode/iso8859_2_to_utf8.map" -#include "../../Unicode/iso8859_3_to_utf8.map" -#include "../../Unicode/iso8859_4_to_utf8.map" -#include "../../Unicode/iso8859_5_to_utf8.map" -#include "../../Unicode/iso8859_6_to_utf8.map" -#include "../../Unicode/iso8859_7_to_utf8.map" -#include "../../Unicode/iso8859_8_to_utf8.map" -#include "../../Unicode/iso8859_9_to_utf8.map" -#include "../../Unicode/utf8_to_iso8859_10.map" -#include "../../Unicode/utf8_to_iso8859_13.map" -#include "../../Unicode/utf8_to_iso8859_14.map" -#include "../../Unicode/utf8_to_iso8859_15.map" -#include "../../Unicode/utf8_to_iso8859_16.map" -#include "../../Unicode/utf8_to_iso8859_2.map" -#include "../../Unicode/utf8_to_iso8859_3.map" -#include "../../Unicode/utf8_to_iso8859_4.map" -#include "../../Unicode/utf8_to_iso8859_5.map" -#include "../../Unicode/utf8_to_iso8859_6.map" -#include "../../Unicode/utf8_to_iso8859_7.map" -#include "../../Unicode/utf8_to_iso8859_8.map" -#include "../../Unicode/utf8_to_iso8859_9.map" -#include "../../Unicode/iso8859_16_to_utf8.map" +#include "../../Unicode/iso8859_10_to_utf8_radix.map" +#include "../../Unicode/iso8859_13_to_utf8_radix.map" +#include "../../Unicode/iso8859_14_to_utf8_radix.map" +#include "../../Unicode/iso8859_15_to_utf8_radix.map" +#include "../../Unicode/iso8859_2_to_utf8_radix.map" +#include "../../Unicode/iso8859_3_to_utf8_radix.map" +#include "../../Unicode/iso8859_4_to_utf8_radix.map" +#include "../../Unicode/iso8859_5_to_utf8_radix.map" +#include "../../Unicode/iso8859_6_to_utf8_radix.map" +#include "../../Unicode/iso8859_7_to_utf8_radix.map" +#include "../../Unicode/iso8859_8_to_utf8_radix.map" +#include "../../Unicode/iso8859_9_to_utf8_radix.map" +#include "../../Unicode/utf8_to_iso8859_10_radix.map" +#include "../../Unicode/utf8_to_iso8859_13_radix.map" +#include "../../Unicode/utf8_to_iso8859_14_radix.map" +#include "../../Unicode/utf8_to_iso8859_15_radix.map" +#include "../../Unicode/utf8_to_iso8859_16_radix.map" +#include "../../Unicode/utf8_to_iso8859_2_radix.map" +#include "../../Unicode/utf8_to_iso8859_3_radix.map" +#include "../../Unicode/utf8_to_iso8859_4_radix.map" +#include "../../Unicode/utf8_to_iso8859_5_radix.map" +#include "../../Unicode/utf8_to_iso8859_6_radix.map" +#include "../../Unicode/utf8_to_iso8859_7_radix.map" +#include "../../Unicode/utf8_to_iso8859_8_radix.map" +#include "../../Unicode/utf8_to_iso8859_9_radix.map" +#include "../../Unicode/iso8859_16_to_utf8_radix.map" PG_MODULE_MAGIC; @@ -60,52 +60,37 @@ PG_FUNCTION_INFO_V1(utf8_to_iso8859); typedef struct { pg_enc encoding; - const pg_local_to_utf *map1; /* to UTF8 map name */ - const pg_utf_to_local *map2; /* from UTF8 map name */ - int size1; /* size of map1 */ - int size2; /* size of map2 */ + const pg_mb_radix_tree *map1; /* to UTF8 map name */ + const pg_mb_radix_tree *map2; /* from UTF8 map name */ } pg_conv_map; static const pg_conv_map maps[] = { - {PG_LATIN2, LUmapISO8859_2, ULmapISO8859_2, - lengthof(LUmapISO8859_2), - lengthof(ULmapISO8859_2)}, /* ISO-8859-2 Latin 2 */ - {PG_LATIN3, LUmapISO8859_3, ULmapISO8859_3, - lengthof(LUmapISO8859_3), - lengthof(ULmapISO8859_3)}, /* ISO-8859-3 Latin 3 */ - {PG_LATIN4, LUmapISO8859_4, ULmapISO8859_4, - lengthof(LUmapISO8859_4), - lengthof(ULmapISO8859_4)}, /* ISO-8859-4 Latin 4 */ - {PG_LATIN5, LUmapISO8859_9, ULmapISO8859_9, - lengthof(LUmapISO8859_9), - lengthof(ULmapISO8859_9)}, /* ISO-8859-9 Latin 5 */ - {PG_LATIN6, LUmapISO8859_10, ULmapISO8859_10, - lengthof(LUmapISO8859_10), - lengthof(ULmapISO8859_10)}, /* ISO-8859-10 Latin 6 */ - {PG_LATIN7, LUmapISO8859_13, ULmapISO8859_13, - lengthof(LUmapISO8859_13), - lengthof(ULmapISO8859_13)}, /* ISO-8859-13 Latin 7 */ - {PG_LATIN8, LUmapISO8859_14, ULmapISO8859_14, - lengthof(LUmapISO8859_14), - lengthof(ULmapISO8859_14)}, /* ISO-8859-14 Latin 8 */ - {PG_LATIN9, LUmapISO8859_15, ULmapISO8859_15, - lengthof(LUmapISO8859_15), - lengthof(ULmapISO8859_15)}, /* ISO-8859-15 Latin 9 */ - {PG_LATIN10, LUmapISO8859_16, ULmapISO8859_16, - lengthof(LUmapISO8859_16), - lengthof(ULmapISO8859_16)}, /* ISO-8859-16 Latin 10 */ - {PG_ISO_8859_5, LUmapISO8859_5, ULmapISO8859_5, - lengthof(LUmapISO8859_5), - lengthof(ULmapISO8859_5)}, /* ISO-8859-5 */ - {PG_ISO_8859_6, LUmapISO8859_6, ULmapISO8859_6, - lengthof(LUmapISO8859_6), - lengthof(ULmapISO8859_6)}, /* ISO-8859-6 */ - {PG_ISO_8859_7, LUmapISO8859_7, ULmapISO8859_7, - lengthof(LUmapISO8859_7), - lengthof(ULmapISO8859_7)}, /* ISO-8859-7 */ - {PG_ISO_8859_8, LUmapISO8859_8, ULmapISO8859_8, - lengthof(LUmapISO8859_8), - lengthof(ULmapISO8859_8)}, /* ISO-8859-8 */ + {PG_LATIN2, &iso8859_2_to_unicode_tree, + &iso8859_2_from_unicode_tree}, /* ISO-8859-2 Latin 2 */ + {PG_LATIN3, &iso8859_3_to_unicode_tree, + &iso8859_3_from_unicode_tree}, /* ISO-8859-3 Latin 3 */ + {PG_LATIN4, &iso8859_4_to_unicode_tree, + &iso8859_4_from_unicode_tree}, /* ISO-8859-4 Latin 4 */ + {PG_LATIN5, &iso8859_9_to_unicode_tree, + &iso8859_9_from_unicode_tree}, /* ISO-8859-9 Latin 5 */ + {PG_LATIN6, &iso8859_10_to_unicode_tree, + &iso8859_10_from_unicode_tree}, /* ISO-8859-10 Latin 6 */ + {PG_LATIN7, &iso8859_13_to_unicode_tree, + &iso8859_13_from_unicode_tree}, /* ISO-8859-13 Latin 7 */ + {PG_LATIN8, &iso8859_14_to_unicode_tree, + &iso8859_14_from_unicode_tree}, /* ISO-8859-14 Latin 8 */ + {PG_LATIN9, &iso8859_15_to_unicode_tree, + &iso8859_15_from_unicode_tree}, /* ISO-8859-15 Latin 9 */ + {PG_LATIN10, &iso8859_16_to_unicode_tree, + &iso8859_16_from_unicode_tree}, /* ISO-8859-16 Latin 10 */ + {PG_ISO_8859_5, &iso8859_5_to_unicode_tree, + &iso8859_5_from_unicode_tree}, /* ISO-8859-5 */ + {PG_ISO_8859_6, &iso8859_6_to_unicode_tree, + &iso8859_6_from_unicode_tree}, /* ISO-8859-6 */ + {PG_ISO_8859_7, &iso8859_7_to_unicode_tree, + &iso8859_7_from_unicode_tree}, /* ISO-8859-7 */ + {PG_ISO_8859_8, &iso8859_8_to_unicode_tree, + &iso8859_8_from_unicode_tree}, /* ISO-8859-8 */ }; Datum @@ -124,7 +109,7 @@ iso8859_to_utf8(PG_FUNCTION_ARGS) if (encoding == maps[i].encoding) { LocalToUtf(src, len, dest, - maps[i].map1, maps[i].size1, + maps[i].map1, 0, NULL, 0, NULL, encoding); @@ -156,7 +141,7 @@ utf8_to_iso8859(PG_FUNCTION_ARGS) if (encoding == maps[i].encoding) { UtfToLocal(src, len, dest, - maps[i].map2, maps[i].size2, + maps[i].map2, 0, NULL, 0, NULL, encoding); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c b/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c index 2eaeae6..2d8ca18 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/johab_to_utf8.map" -#include "../../Unicode/utf8_to_johab.map" +#include "../../Unicode/johab_to_utf8_radix.map" +#include "../../Unicode/utf8_to_johab_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ johab_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_JOHAB, PG_UTF8); LocalToUtf(src, len, dest, - LUmapJOHAB, lengthof(LUmapJOHAB), + &johab_to_unicode_tree, 0, NULL, 0, NULL, PG_JOHAB); @@ -60,7 +60,7 @@ utf8_to_johab(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_JOHAB); UtfToLocal(src, len, dest, - ULmapJOHAB, lengthof(ULmapJOHAB), + &johab_from_unicode_tree, 0, NULL, 0, NULL, PG_JOHAB); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c b/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c index 204e2a0..0a4802d 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/sjis_to_utf8.map" -#include "../../Unicode/utf8_to_sjis.map" +#include "../../Unicode/sjis_to_utf8_radix.map" +#include "../../Unicode/utf8_to_sjis_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ sjis_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_SJIS, PG_UTF8); LocalToUtf(src, len, dest, - LUmapSJIS, lengthof(LUmapSJIS), + &sjis_to_unicode_tree, 0, NULL, 0, NULL, PG_SJIS); @@ -60,7 +60,7 @@ utf8_to_sjis(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_SJIS); UtfToLocal(src, len, dest, - ULmapSJIS, lengthof(ULmapSJIS), + &sjis_from_unicode_tree, 0, NULL, 0, NULL, PG_SJIS); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c b/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c index b80eb7e..7160741 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/shift_jis_2004_to_utf8.map" -#include "../../Unicode/utf8_to_shift_jis_2004.map" +#include "../../Unicode/shift_jis_2004_to_utf8_radix.map" +#include "../../Unicode/utf8_to_shift_jis_2004_radix.map" #include "../../Unicode/shift_jis_2004_to_utf8_combined.map" #include "../../Unicode/utf8_to_shift_jis_2004_combined.map" @@ -44,7 +44,7 @@ shift_jis_2004_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_SHIFT_JIS_2004, PG_UTF8); LocalToUtf(src, len, dest, - LUmapSHIFT_JIS_2004, lengthof(LUmapSHIFT_JIS_2004), + &shift_jis_2004_to_unicode_tree, 0, LUmapSHIFT_JIS_2004_combined, lengthof(LUmapSHIFT_JIS_2004_combined), NULL, PG_SHIFT_JIS_2004); @@ -62,7 +62,7 @@ utf8_to_shift_jis_2004(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_SHIFT_JIS_2004); UtfToLocal(src, len, dest, - ULmapSHIFT_JIS_2004, lengthof(ULmapSHIFT_JIS_2004), + &shift_jis_2004_from_unicode_tree, 0, ULmapSHIFT_JIS_2004_combined, lengthof(ULmapSHIFT_JIS_2004_combined), NULL, PG_SHIFT_JIS_2004); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c b/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c index 71214d2..fb66a8a 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c @@ -14,8 +14,8 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/uhc_to_utf8.map" -#include "../../Unicode/utf8_to_uhc.map" +#include "../../Unicode/uhc_to_utf8_radix.map" +#include "../../Unicode/utf8_to_uhc_radix.map" PG_MODULE_MAGIC; @@ -42,7 +42,7 @@ uhc_to_utf8(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UHC, PG_UTF8); LocalToUtf(src, len, dest, - LUmapUHC, lengthof(LUmapUHC), + &uhc_to_unicode_tree, 0, NULL, 0, NULL, PG_UHC); @@ -60,7 +60,7 @@ utf8_to_uhc(PG_FUNCTION_ARGS) CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_UHC); UtfToLocal(src, len, dest, - ULmapUHC, lengthof(ULmapUHC), + &uhc_from_unicode_tree, 0, NULL, 0, NULL, PG_UHC); diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c b/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c index 4d9c641..d213927 100644 --- a/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c +++ b/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c @@ -14,28 +14,28 @@ #include "postgres.h" #include "fmgr.h" #include "mb/pg_wchar.h" -#include "../../Unicode/utf8_to_win1250.map" -#include "../../Unicode/utf8_to_win1251.map" -#include "../../Unicode/utf8_to_win1252.map" -#include "../../Unicode/utf8_to_win1253.map" -#include "../../Unicode/utf8_to_win1254.map" -#include "../../Unicode/utf8_to_win1255.map" -#include "../../Unicode/utf8_to_win1256.map" -#include "../../Unicode/utf8_to_win1257.map" -#include "../../Unicode/utf8_to_win1258.map" -#include "../../Unicode/utf8_to_win866.map" -#include "../../Unicode/utf8_to_win874.map" -#include "../../Unicode/win1250_to_utf8.map" -#include "../../Unicode/win1251_to_utf8.map" -#include "../../Unicode/win1252_to_utf8.map" -#include "../../Unicode/win1253_to_utf8.map" -#include "../../Unicode/win1254_to_utf8.map" -#include "../../Unicode/win1255_to_utf8.map" -#include "../../Unicode/win1256_to_utf8.map" -#include "../../Unicode/win1257_to_utf8.map" -#include "../../Unicode/win866_to_utf8.map" -#include "../../Unicode/win874_to_utf8.map" -#include "../../Unicode/win1258_to_utf8.map" +#include "../../Unicode/utf8_to_win1250_radix.map" +#include "../../Unicode/utf8_to_win1251_radix.map" +#include "../../Unicode/utf8_to_win1252_radix.map" +#include "../../Unicode/utf8_to_win1253_radix.map" +#include "../../Unicode/utf8_to_win1254_radix.map" +#include "../../Unicode/utf8_to_win1255_radix.map" +#include "../../Unicode/utf8_to_win1256_radix.map" +#include "../../Unicode/utf8_to_win1257_radix.map" +#include "../../Unicode/utf8_to_win1258_radix.map" +#include "../../Unicode/utf8_to_win866_radix.map" +#include "../../Unicode/utf8_to_win874_radix.map" +#include "../../Unicode/win1250_to_utf8_radix.map" +#include "../../Unicode/win1251_to_utf8_radix.map" +#include "../../Unicode/win1252_to_utf8_radix.map" +#include "../../Unicode/win1253_to_utf8_radix.map" +#include "../../Unicode/win1254_to_utf8_radix.map" +#include "../../Unicode/win1255_to_utf8_radix.map" +#include "../../Unicode/win1256_to_utf8_radix.map" +#include "../../Unicode/win1257_to_utf8_radix.map" +#include "../../Unicode/win866_to_utf8_radix.map" +#include "../../Unicode/win874_to_utf8_radix.map" +#include "../../Unicode/win1258_to_utf8_radix.map" PG_MODULE_MAGIC; @@ -56,46 +56,22 @@ PG_FUNCTION_INFO_V1(utf8_to_win); typedef struct { pg_enc encoding; - const pg_local_to_utf *map1; /* to UTF8 map name */ - const pg_utf_to_local *map2; /* from UTF8 map name */ - int size1; /* size of map1 */ - int size2; /* size of map2 */ + const pg_mb_radix_tree *map1; /* to UTF8 map name */ + const pg_mb_radix_tree *map2; /* from UTF8 map name */ } pg_conv_map; static const pg_conv_map maps[] = { - {PG_WIN866, LUmapWIN866, ULmapWIN866, - lengthof(LUmapWIN866), - lengthof(ULmapWIN866)}, - {PG_WIN874, LUmapWIN874, ULmapWIN874, - lengthof(LUmapWIN874), - lengthof(ULmapWIN874)}, - {PG_WIN1250, LUmapWIN1250, ULmapWIN1250, - lengthof(LUmapWIN1250), - lengthof(ULmapWIN1250)}, - {PG_WIN1251, LUmapWIN1251, ULmapWIN1251, - lengthof(LUmapWIN1251), - lengthof(ULmapWIN1251)}, - {PG_WIN1252, LUmapWIN1252, ULmapWIN1252, - lengthof(LUmapWIN1252), - lengthof(ULmapWIN1252)}, - {PG_WIN1253, LUmapWIN1253, ULmapWIN1253, - lengthof(LUmapWIN1253), - lengthof(ULmapWIN1253)}, - {PG_WIN1254, LUmapWIN1254, ULmapWIN1254, - lengthof(LUmapWIN1254), - lengthof(ULmapWIN1254)}, - {PG_WIN1255, LUmapWIN1255, ULmapWIN1255, - lengthof(LUmapWIN1255), - lengthof(ULmapWIN1255)}, - {PG_WIN1256, LUmapWIN1256, ULmapWIN1256, - lengthof(LUmapWIN1256), - lengthof(ULmapWIN1256)}, - {PG_WIN1257, LUmapWIN1257, ULmapWIN1257, - lengthof(LUmapWIN1257), - lengthof(ULmapWIN1257)}, - {PG_WIN1258, LUmapWIN1258, ULmapWIN1258, - lengthof(LUmapWIN1258), - lengthof(ULmapWIN1258)}, + {PG_WIN866, &win866_to_unicode_tree, &win866_from_unicode_tree}, + {PG_WIN874, &win874_to_unicode_tree, &win874_from_unicode_tree}, + {PG_WIN1250, &win1250_to_unicode_tree, &win1250_from_unicode_tree}, + {PG_WIN1251, &win1251_to_unicode_tree, &win1251_from_unicode_tree}, + {PG_WIN1252, &win1252_to_unicode_tree, &win1252_from_unicode_tree}, + {PG_WIN1253, &win1253_to_unicode_tree, &win1253_from_unicode_tree}, + {PG_WIN1254, &win1254_to_unicode_tree, &win1254_from_unicode_tree}, + {PG_WIN1255, &win1255_to_unicode_tree, &win1255_from_unicode_tree}, + {PG_WIN1256, &win1256_to_unicode_tree, &win1256_from_unicode_tree}, + {PG_WIN1257, &win1257_to_unicode_tree, &win1257_from_unicode_tree}, + {PG_WIN1258, &win1258_to_unicode_tree, &win1258_from_unicode_tree}, }; Datum @@ -114,7 +90,7 @@ win_to_utf8(PG_FUNCTION_ARGS) if (encoding == maps[i].encoding) { LocalToUtf(src, len, dest, - maps[i].map1, maps[i].size1, + maps[i].map1, 0, NULL, 0, NULL, encoding); @@ -146,7 +122,7 @@ utf8_to_win(PG_FUNCTION_ARGS) if (encoding == maps[i].encoding) { UtfToLocal(src, len, dest, - maps[i].map2, maps[i].size2, + maps[i].map2, 0, NULL, 0, NULL, encoding); diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 24e8d0d..38edbff 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -384,6 +384,28 @@ typedef struct } pg_utf_to_local; /* + * radix tree structer for faster conversion + */ +typedef struct pg_mb_radix_index +{ + uint8 lower, upper; /* index range of b2idx */ + uint32 idx[FLEXIBLE_ARRAY_MEMBER]; /* index body */ +} pg_mb_radix_index; + +typedef struct +{ + const uint8 chars_lower, chars_upper; /* index range of chars* */ + const bool single_byte; /* true if the first segment is + * for single byte characters*/ + const uint16 *chars16; /* 16 bit character table */ + const uint32 *chars32; /* 32 bit character table */ + + const pg_mb_radix_index *b2idx; + const pg_mb_radix_index *b3idx[2]; + const pg_mb_radix_index *b4idx[3]; +} pg_mb_radix_tree; + +/* * local code to UTF-8 conversion map */ typedef struct @@ -510,14 +532,14 @@ extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc); extern void UtfToLocal(const unsigned char *utf, int len, unsigned char *iso, - const pg_utf_to_local *map, int mapsize, - const pg_utf_to_local_combined *cmap, int cmapsize, + const void *map, int mapsize, + const void *combined_map, int cmapsize, utf_local_conversion_func conv_func, int encoding); extern void LocalToUtf(const unsigned char *iso, int len, unsigned char *utf, - const pg_local_to_utf *map, int mapsize, - const pg_local_to_utf_combined *cmap, int cmapsize, + const void *map, int mapsize, + const void *combined_cmap, int cmapsize, utf_local_conversion_func conv_func, int encoding); @@ -551,6 +573,7 @@ extern void latin2mic_with_table(const unsigned char *l, unsigned char *p, extern void mic2latin_with_table(const unsigned char *mic, unsigned char *p, int len, int lc, int encoding, const unsigned char *tab); +extern const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c); extern bool pg_utf8_islegal(const unsigned char *source, int length); -- 2.9.2
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers