[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Changes by Jakub Wilk jw...@jwilk.net: -- nosy: +jwilk ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
STINNER Victor victor.stin...@haypocalc.com added the comment: Hyeshik Chang, who opened this issue, wrote (msg83672) When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. (...) I don't think that Python is the right place to support such encoding. Eg. a patch for iconv would be a better idea (if iconv doesn't support this encoding yet). I close this issue as wont fix. -- resolution: - wont fix status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Terry J. Reedy tjre...@udel.edu added the comment: It seems to me that the last few messages suggest that this should be closed. -- nosy: +terry.reedy versions: +Python 3.2 -Python 2.7, Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-03-14 02:32, Antoine Pitrou wrote: Antoine Pitrou pit...@free.fr added the comment: Based on the feedback above, it seems this should be committed, shouldn't it? +1 As mentioned several times on the ticket: static C data is not really something to worry about these days. -- title: Adding new CNS11643, a *huge* charset, support in cjkcodecs - Adding new CNS11643, a *huge* charset,support in cjkcodecs ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Changes by STINNER Victor victor.stin...@haypocalc.com: -- nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Antoine Pitrou pit...@free.fr added the comment: Le mardi 17 mars 2009 à 10:56 +, Marc-Andre Lemburg a écrit : +1 As mentioned several times on the ticket: static C data is not really something to worry about these days. Well, I suggest that someone familiar with the codec-building machinery do the committing, in order to avoid mistakes :-) -- title: Adding new CNS11643, a *huge* charset, support in cjkcodecs - Adding new CNS11643, a *huge* charset, support in cjkcodecs ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang hyes...@gmail.com added the comment: When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. I'm quite neutral in adding this into python without any user's request from Taiwan (I'm from South Korea :), but I can finish committing it with pleasure if you are still fond of the codec. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-03-17 13:30, Hye-Shik Chang wrote: Hye-Shik Chang hyes...@gmail.com added the comment: When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. I'm quite neutral in adding this into python without any user's request from Taiwan (I'm from South Korea :), but I can finish committing it with pleasure if you are still fond of the codec. If there's no user base for it, then we should not include it. I was under the impression that this charset is essential for the Taiwanese and Chinese (http://www.cns11643.gov.tw/). However, the wiki page http://en.wikipedia.org/wiki/CNS_11643 says In practice, variants of Big5 are de facto standard., so perhaps there's no real need for the codec after all. The German version of the wiki page mentions that CNS11643 is the legal standard charset, but not used much in practice because it needs 3 bytes per glyph instead of just 2 for Big5 variants. The Chinese version of the wiki page says more or less the same: http://translate.google.de/translate?hl=ensl=zh-TWu=http://zh.wikipedia.org/wiki/%25E5%259C%258B%25E5%25AE%25B6%25E6%25A8%2599%25E6%25BA%2596%25E4%25B8%25AD%25E6%2596%2587%25E4%25BA%25A4%25E6%258F%259B%25E7%25A2%25BCei=C52_SZepPJKTsAbw8PW5DQsa=Xoi=translateresnum=1ct=resultprev=/search%3Fq%3Dhttp://zh.wikipedia.org/wiki/%2525E5%25259C%25258B%2525E5%2525AE%2525B6%2525E6%2525A8%252599%2525E6%2525BA%252596%2525E4%2525B8%2525AD%2525E6%252596%252587%2525E4%2525BA%2525A4%2525E6%25258F%25259B%2525E7%2525A2%2525BC%26hl%3Den%26sa%3DG -- title: Adding new CNS11643, a *huge* charset, support in cjkcodecs - Adding new CNS11643, a *huge* charset,support in cjkcodecs ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Antoine Pitrou pit...@free.fr added the comment: Based on the feedback above, it seems this should be committed, shouldn't it? -- nosy: +pitrou stage: - commit review type: - feature request versions: +Python 2.7, Python 3.1 -Python 2.6, Python 3.0 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2066 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg added the comment: Whether or not to keep placing all builtin modules into the Windows Python DLL is not really a question to be discussed on the tracker. Given the size of the Python DLL (around 2MB) and the extra 350kB that the support for CNS11643 would cost, I think such a discussion is pretty pointless. I'm still +1 on the basis of enhancing the Taiwanese Python experience by adding their standard character set to the default Python install. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Giovanni Bajo added the comment: Making the standard Windows Python DLL larger is not only a problem of disk size: it will make all packages produced by PyInstaller or py2exe larger, and that means lots of wasted bandwidth. I see that MvL is still -1 on simply splitting CJK codecs out, and vetos it by asking for a generalization work of insane proportion (a hard-to-define PEP, an entirely new build system for Windows, etc.). I understand (and *agree*) that having a general rule would be a much superior solution, but CJK is already almost 50% of the python.dll, so it *is* already a special case by any means. And special cases like these could be handled with special-case decisions. Thus, I still strongly disagree with MvL and would like CJK be split out of python.dll as soon as possible. I would not really ask this for any other modules but CJK, and understand that further actions would really require a PEP and a new build system for Windows. So, I ask again MvL to soften his position and reconsider the CJK splitting in all its singularity. Please! (in case it's not clear, I would prepare a patch to split CJK out anyday if there were hopes that it gets accepted) -- nosy: +giovannibajo __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I have generated compressed mapping tables by several ways. I extracted mapping data into individual files and reorganized them by translating into Python source code or archiving into a zip file. The following table shows the result: (in kilobytes) (also available at http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA ) noneminimal MSjkMSall current Text0 207 312 342 570 Data904 696 592 562 333 raw-py 3006239220161932996 zip-py 720 496 416 384 304 raw-pyc 952 734 624 590 346 zip-pyc 560 384 336 304 240 Text+zip-pyc560 591 648 646 810 raw-both39543124263825201340 zip-both1248864 736 672 512 zip-bare560 384 336 304 240 tarbz2-bare 496 352 320 304 240 Columns represent which mapping files are separated into external files. In none, no mapping is left as static const C data while only new cns11643 mappings are extracted in current column. minimal set has the major character set for each country in static C data and other are out. And MSjk includes some more MS codepages of Japan and Korea, and MSall includes all MS codepage extensions in static const C data. We may fix the list which character sets remain as C data or let users pick the sets using configure option. Text is portion that remains in static const C data where is all the current mapping tables are in. As discussed when CJKCodecs had been integrated into python, it can be shared over processes in a system and efficient, but it can't be compressed or reorganized easily by users for redistribution. Data is externally managed mapping tables. raw-py row shows total volume of mapping tables as in Python source code. raw-pyc shows compiled (pyc) version of mapping tables. zip-py and zip-pyc are zip-compressed archive of raw-py and raw-pyc, respectively. Those can be imported using python zipimport machinery. zip-bare and tarbz2-bare shows volume of archived raw mapping table files as you can notice from their name. We have 560KB of mapping tables in the Python CJKCodecs part. If we choose zip-pyc of minimal set, the binary distribution will be just as big as before even if we include CNS11643 character set and pythonXY.dll will get smaller by 363KB. What do you think about the scheme or Any other idea for compression? __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg added the comment: I think Martin was looking for other optimizations that still leave the data in a static C const (in order to be shared between processes and only loaded on demand), but do compress the data representation, e.g. using some form of Huffman coding. While I don't see adding a few 100kB of static C data to a DLL as a major problem (even less so, if it's possible to disable support via a configure switch, e.g. for embedded systems), it would be interesting to check whether the lookups tables can be compressed by way of their structure. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I couldn't find an appropriate method to implement in situ compressed mapping table. AFAIK, python has the smallest mapping table footprint for each charset among major open source transcoding programs. I have thought about the compression many times, but every neat method required severe performance sacrifice. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg added the comment: In that case, I'm +1 on adding it. The OS won't load those tables unless really needed, so it's more a question of disk space than anything else. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Martin v. Löwis added the comment: I would like to see whether a compression mechanism of the tables could be found. If all else fails, compressing with raw zlib might improve things, but before that, I think other compression techniques should be studied. I'm still -1 on ad-hoc exclusion of extension modules from pythonxy.dll. If this module is to be excluded, a general policy should be established that determines what modules get compiled separately, and an automation mechanism should be established that automates generation of appropriate build infrastructure for modules built separately under this policy. -- nosy: +loewis __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Kuang-che Wu added the comment: FYI, according to the new spec of cns11643-2004 (you can search the preview from http://www.cnsonline.com.tw/, at http://www.cnsonline.com.tw/preview/preview.jsp? general_no=1164300language=Cpagecount=524). From page 499, it mensioned an URL http://www.cnscode.org.tw/ and the version 3 mapping table could be found at http://www.cnscode.org.tw/cnscode/csic_ucs.jsp -- nosy: +kcwu __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Martin v. Löwis added the comment: BTW, which version of CNS11643 does that implement? AFAICT, there is CNS 11643-1986 and CNS 11643-1992. Where did you get the Unicode mapping from? __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg added the comment: Some background information: http://www.cns11643.gov.tw/eng/word.jsp The most recent version appears to be: CNS11643-2004, sometimes also called CNS11643 version 3 or CNS11643-3 (http://docs.hp.com/en/5991-7974/5991-7974.pdf). Here's the table for version 1 (1986): ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT Versions 1 and 2 (1992) are also included in the official Unicode Han character database (along with several other mappings): http://www.unicode.org/charts/unihan.html I couldn't find a reference to a version 3 mapping table. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Marc-Andre Lemburg added the comment: How often would this character set be needed ? In any case, using a (pre)compiler switch is not a good idea. Please add support to enable/disable the support via a configure switch. -- nosy: +lemburg __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Changes by Hye-Shik Chang: -- title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs - Adding new CNS11643, a *huge* charset, support in cjkcodecs __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Hye-Shik Chang added the comment: I've generated the mapping table from ICU's CNS11643-1992 mapping. I see that CNS11643 is quite rarely used in the internet, but it's the only national standard character set in Taiwan. Asking Taiwanese python users, even they didn't think that it's necessary to add into Python. I'll study how much compression is possible and how efficient it is, then submit a revised patch again. Thank you for comments! __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs
Amaury Forgeot d'Arc added the comment: In this case let's put the cjkcodecs modules in their own DLL(s) on win32. -- nosy: +amaury.forgeotdarc __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2066 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com