[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2013-07-24 Thread Jakub Wilk

Changes by Jakub Wilk jw...@jwilk.net:


--
nosy: +jwilk

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2010-08-12 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Hyeshik Chang, who opened this issue, wrote (msg83672) When I asked Taiwanese 
developers how often they use these character sets, it appeared that they are 
almost useless in the usual computing
environment in Taiwan. This will only serve for a historical
compatibility and literal standard compliance. (...)

I don't think that Python is the right place to support such encoding. Eg. a 
patch for iconv would be a better idea (if iconv doesn't support this encoding 
yet).

I close this issue as wont fix.

--
resolution:  - wont fix
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2010-08-08 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

It seems to me that the last few messages suggest that this should be closed.

--
nosy: +terry.reedy
versions: +Python 3.2 -Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-03-14 02:32, Antoine Pitrou wrote:
 Antoine Pitrou pit...@free.fr added the comment:
 
 Based on the feedback above, it seems this should be committed,
 shouldn't it?

+1

As mentioned several times on the ticket: static C data is not really
something to worry about these days.

--
title: Adding new CNS11643, a *huge* charset, support in cjkcodecs - Adding 
new CNS11643, a *huge* charset,support in cjkcodecs

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Le mardi 17 mars 2009 à 10:56 +, Marc-Andre Lemburg a écrit :
 +1
 
 As mentioned several times on the ticket: static C data is not really
 something to worry about these days.

Well, I suggest that someone familiar with the codec-building machinery
do the committing, in order to avoid mistakes :-)

--
title: Adding new CNS11643, a *huge* charset,   support in cjkcodecs - Adding 
new CNS11643, a *huge* charset, support in cjkcodecs

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread Hye-Shik Chang

Hye-Shik Chang hyes...@gmail.com added the comment:

When I asked Taiwanese developers how often they use these character
sets, it appeared that they are almost useless in the usual computing
environment in Taiwan.  This will only serve for a historical
compatibility and literal standard compliance.  I'm quite neutral in
adding this into python without any user's request from Taiwan (I'm from
South Korea :), but I can finish committing it with pleasure if you are
still fond of the codec.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-03-17 13:30, Hye-Shik Chang wrote:
 Hye-Shik Chang hyes...@gmail.com added the comment:
 
 When I asked Taiwanese developers how often they use these character
 sets, it appeared that they are almost useless in the usual computing
 environment in Taiwan.  This will only serve for a historical
 compatibility and literal standard compliance.  I'm quite neutral in
 adding this into python without any user's request from Taiwan (I'm from
 South Korea :), but I can finish committing it with pleasure if you are
 still fond of the codec.

If there's no user base for it, then we should not include it.

I was under the impression that this charset is essential for the Taiwanese
and Chinese (http://www.cns11643.gov.tw/).

However, the wiki page http://en.wikipedia.org/wiki/CNS_11643
says In practice, variants of Big5 are de facto standard., so perhaps
there's no real need for the codec after all.

The German version of the wiki page mentions that CNS11643 is the legal
standard charset, but not used much in practice because it needs 3 bytes
per glyph instead of just 2 for Big5 variants.

The Chinese version of the wiki page says more or less the same:

http://translate.google.de/translate?hl=ensl=zh-TWu=http://zh.wikipedia.org/wiki/%25E5%259C%258B%25E5%25AE%25B6%25E6%25A8%2599%25E6%25BA%2596%25E4%25B8%25AD%25E6%2596%2587%25E4%25BA%25A4%25E6%258F%259B%25E7%25A2%25BCei=C52_SZepPJKTsAbw8PW5DQsa=Xoi=translateresnum=1ct=resultprev=/search%3Fq%3Dhttp://zh.wikipedia.org/wiki/%2525E5%25259C%25258B%2525E5%2525AE%2525B6%2525E6%2525A8%252599%2525E6%2525BA%252596%2525E4%2525B8%2525AD%2525E6%252596%252587%2525E4%2525BA%2525A4%2525E6%25258F%25259B%2525E7%2525A2%2525BC%26hl%3Den%26sa%3DG

--
title: Adding new CNS11643, a *huge* charset, support in cjkcodecs - Adding 
new CNS11643, a *huge* charset,support in cjkcodecs

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-13 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Based on the feedback above, it seems this should be committed,
shouldn't it?

--
nosy: +pitrou
stage:  - commit review
type:  - feature request
versions: +Python 2.7, Python 3.1 -Python 2.6, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2066
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Whether or not to keep placing all builtin modules into the Windows
Python DLL is not really a question to be discussed on the tracker.
Given the size of the Python DLL (around 2MB) and the extra 350kB that
the support for CNS11643 would cost, I think such a discussion is pretty
pointless.

I'm still +1 on the basis of enhancing the Taiwanese Python experience
by adding their standard character set to the default Python install.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-16 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Making the standard Windows Python DLL larger is not only a problem of
disk size: it will make all packages produced by PyInstaller or py2exe
larger, and that means lots of wasted bandwidth.

I see that MvL is still -1 on simply splitting CJK codecs out, and vetos
it by asking for a generalization work of insane proportion (a
hard-to-define PEP, an entirely new build system for Windows, etc.).

I understand (and *agree*) that having a general rule would be a much
superior solution, but CJK is already almost 50% of the python.dll, so
it *is* already a special case by any means. And special cases like
these  could be handled with special-case decisions.

Thus, I still strongly disagree with MvL and would like CJK be split out
 of python.dll as soon as possible. I would not really ask this for any
other modules but CJK, and understand that further actions would really
require a PEP and a new build system for Windows.

So, I ask again MvL to soften his position and reconsider the CJK
splitting in all its singularity. Please!

(in case it's not clear, I would prepare a patch to split CJK out anyday
if there were hopes that it gets accepted)

--
nosy: +giovannibajo

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I have generated compressed mapping tables by several ways.

I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.

The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )

noneminimal MSjkMSall   current
Text0   207 312 342 570 
Data904 696 592 562 333 

raw-py  3006239220161932996 
zip-py  720 496 416 384 304 

raw-pyc 952 734 624 590 346 
zip-pyc 560 384 336 304 240 
Text+zip-pyc560 591 648 646 810 

raw-both39543124263825201340
zip-both1248864 736 672 512 
   
zip-bare560 384 336 304 240 
tarbz2-bare 496 352 320 304 240 

Columns represent which mapping files are separated into external
files.  In none, no mapping is left as static const C data while
only new cns11643 mappings are extracted in current column.
minimal set has the major character set for each country in
static C data and other are out.  And MSjk includes some more
MS codepages of Japan and Korea, and MSall includes all MS
codepage extensions in static const C data.  We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.

Text is portion that remains in static const C data where is all
the current mapping tables are in.  As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution.  Data is externally managed
mapping tables.

raw-py row shows total volume of mapping tables as in Python
source code.  raw-pyc shows compiled (pyc) version of mapping
tables.  zip-py and zip-pyc are zip-compressed archive of
raw-py and raw-pyc, respectively.  Those can be imported
using python zipimport machinery.

zip-bare and tarbz2-bare shows volume of archived raw mapping
table files as you can notice from their name.

We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose zip-pyc of minimal set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.

What do you think about the scheme or
Any other idea for compression?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

I think Martin was looking for other optimizations that still leave the
data in a static C const (in order to be shared between processes and
only loaded on demand), but do compress the data representation, e.g.
using some form of Huffman coding.

While I don't see adding a few 100kB of static C data to a DLL as a
major problem (even less so, if it's possible to disable support via a
configure switch, e.g. for embedded systems), it would be interesting to
check whether the lookups tables can be compressed by way of their
structure.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I couldn't find an appropriate method to implement in situ
compressed mapping table.  AFAIK, python has the smallest
mapping table footprint for each charset among major open
source transcoding programs.  I have thought about the
compression many times, but every neat method required
severe performance sacrifice.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

In that case, I'm +1 on adding it.

The OS won't load those tables unless really needed, so it's more a
question of disk space than anything else.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Martin v. Löwis

Martin v. Löwis added the comment:

I would like to see whether a compression mechanism of the tables could
be found. If all else fails, compressing with raw zlib might improve
things, but before that, I think other compression techniques should be
studied.

I'm still -1 on ad-hoc exclusion of extension modules from pythonxy.dll.
If this module is to be excluded, a general policy should be established
that determines what modules get compiled separately, and an automation
mechanism should be established that automates generation of appropriate
build infrastructure for modules built separately under this policy.

--
nosy: +loewis

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Kuang-che Wu

Kuang-che Wu added the comment:

FYI, according to the new spec of cns11643-2004 (you can search the 
preview from http://www.cnsonline.com.tw/, at 
http://www.cnsonline.com.tw/preview/preview.jsp?
general_no=1164300language=Cpagecount=524).
From page 499, it mensioned an URL http://www.cnscode.org.tw/ and the 
version 3 mapping table could be found at 
http://www.cnscode.org.tw/cnscode/csic_ucs.jsp

--
nosy: +kcwu

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Martin v. Löwis

Martin v. Löwis added the comment:

BTW, which version of CNS11643 does that implement? AFAICT, there is CNS
11643-1986 and CNS 11643-1992. Where did you get the Unicode mapping from?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Some background information: http://www.cns11643.gov.tw/eng/word.jsp

The most recent version appears to be: CNS11643-2004, sometimes also
called CNS11643 version 3 or CNS11643-3
(http://docs.hp.com/en/5991-7974/5991-7974.pdf).

Here's the table for version 1 (1986):
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT

Versions 1 and 2 (1992) are also included in the official Unicode Han
character database (along with several other mappings):
http://www.unicode.org/charts/unihan.html

I couldn't find a reference to a version 3 mapping table.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

How often would this character set be needed ?

In any case, using a (pre)compiler switch is not a good idea. Please add
support to enable/disable the support via a configure switch.

--
nosy: +lemburg

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Hye-Shik Chang

Changes by Hye-Shik Chang:


--
title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs - Adding 
new CNS11643, a *huge* charset, support in cjkcodecs

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I've generated the mapping table from ICU's CNS11643-1992 mapping.
I see that CNS11643 is quite rarely used in the internet, but it's the
only national standard character set in Taiwan.  Asking Taiwanese
python users, even they didn't think that it's necessary to add into
Python.  I'll study how much compression is possible and how efficient
it is, then submit a revised patch again.

Thank you for comments!

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc added the comment:

In this case let's put the cjkcodecs modules in their own
DLL(s) on win32.

--
nosy: +amaury.forgeotdarc

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2066
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com