Package: libc6 Version: 2.3.6.ds1-13 Severity: important Problem: ~ ' \ conversion.
In short, iconv should not to smart guessing for 7 bit section of each traditional encodings which was ASCII compatible. They should be same in that 7 bit section. Here we go.... For all popular C/perl/shell/... programs written originally in latin-1, latin-2, ..., shift-jis, euc-jp, ... encodings will break if iconv is used to convert them in UTF-8. iconv does half-smart job to please some cosmetic factors but forgot about how these encodings were originally developed and used in real life so it is harmful to the data. (Of course those funny 8 bit texts are in the comments and the quoted text) In this sense, I could file grave bug for breaking data but considering timing, I stay with important. (After etch, I may raise this bug severity.) All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ... ) were developed so non-ASCII characters can be expressed without breaking existing tools/codes developped for ASCII. That is why they are ASCII compatible. All 0x00-0x7f (7bit) represented characters shared the same position (We do use alternative font for the ASCII 0x5c = back_lash = '\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII and yen in shift-jis serves the same purpose in the program world. C standard even mention about dual nature of \.) So by changing encoding of the file, we expect all 0x00-0x7f (7bit) to remain the same. But I iconv does many funny things. The code 0x27 (single-quote) is changed to something else (long UTF-8 sequence for single-quote) when converted from any of latin-1, latin-2, shift-jis, euc-jp,... to UTF-8 changes. This is not expected. For shift-jis, it is even worse. iconv tries to map character 0x5c to UTF-8 YEN mark. That mapping should be done for the yen mark code in 16bit (full width character section) and not for this 7 bit one. This is very bad for any program. Another issue is 0x7e '~'. This is translated to upper bar. Although some Japanese old PC (pre-IBM compatible, NEC 98 machines, I think) had upper bar shaped font for ~, converting this ~ in data to UTF-8 upper bar breaks URLs data stored on shift-jis machines. The choice of conversion table should not be based on superficial shape caparison but should take into full account of actual usage and implication. iconv being basic tool, it should not do these conversion on 7 bit code for these. If anyone want syntactical pretty print conversion of UTF-8 text, it should rely on some other tool. Then they can use open and closing quote if they wish. But we can keep C programs right. Many old C programs in each locale used to use these ASCII compatible encodings and all we want to do is convert quoted text and comments to UTF-8. -- System Information: Debian Release: lenny/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing') Architecture: amd64 (x86_64) Shell: /bin/sh linked to /bin/bash Kernel: Linux 2.6.18-mactel64 Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Versions of packages libc6 depends on: ii tzdata 2007d-1 Time Zone and Daylight Saving Time libc6 recommends no packages. -- debconf-show failed Conversion results are attached as diffs.
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900 +++ eucj-utf8.txt 2007-04-07 00:10:26.000000000 +0900 @@ -39,7 +39,7 @@ 044 36 24 $ 144 100 64 d 045 37 25 % 145 101 65 e 046 38 26 & 146 102 66 f - 047 39 27 ’ 147 103 67 g + 047 39 27 147 103 67 g 050 40 28 ( 150 104 68 h 051 41 29 ) 151 105 69 i 052 42 2A * 152 106 6A j
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900 +++ shiftj-utf8.txt 2007-04-07 00:10:45.000000000 +0900 @@ -28,7 +28,7 @@ 031 25 19 EM 131 89 59 Y 032 26 1A SUB 132 90 5A Z 033 27 1B ESC 133 91 5B [ - 034 28 1C FS 134 92 5C \ + 034 28 1C FS 134 92 5C \ 035 29 1D GS 135 93 5D ] 036 30 1E RS 136 94 5E ^ 037 31 1F US 137 95 5F _ @@ -39,7 +39,7 @@ 044 36 24 $ 144 100 64 d 045 37 25 % 145 101 65 e 046 38 26 & 146 102 66 f - 047 39 27 ’ 147 103 67 g + 047 39 27 窶 147 103 67 g 050 40 28 ( 150 104 68 h 051 41 29 ) 151 105 69 i 052 42 2A * 152 106 6A j @@ -62,6 +62,6 @@ 073 59 3B ; 173 123 7B { 074 60 3C < 174 124 7C | 075 61 3D = 175 125 7D } - 076 62 3E > 176 126 7E ~ + 076 62 3E > 176 126 7E ~ 077 63 3F ? 177 127 7F DEL
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900 +++ l1-utf8.txt 2007-04-07 00:10:59.000000000 +0900 @@ -39,7 +39,7 @@ 044 36 24 $ 144 100 64 d 045 37 25 % 145 101 65 e 046 38 26 & 146 102 66 f - 047 39 27 ’ 147 103 67 g + 047 39 27 â 147 103 67 g 050 40 28 ( 150 104 68 h 051 41 29 ) 151 105 69 i 052 42 2A * 152 106 6A j
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900 +++ l2-utf8.txt 2007-04-07 00:11:05.000000000 +0900 @@ -39,7 +39,7 @@ 044 36 24 $ 144 100 64 d 045 37 25 % 145 101 65 e 046 38 26 & 146 102 66 f - 047 39 27 ’ 147 103 67 g + 047 39 27 â 147 103 67 g 050 40 28 ( 150 104 68 h 051 41 29 ) 151 105 69 i 052 42 2A * 152 106 6A j