[issue1037] Ill-coded identifier crashes python when coding spec is utf-8

2007-08-27 Thread Hye-Shik Chang

New submission from Hye-Shik Chang:

Illegal identifier makes python crash on UTF-8 source codes/interpreters.

Python 3.0x (py3k:57555M, Aug 27 2007, 21:23:47) 
[GCC 3.4.6 [FreeBSD] 20060305] on freebsd6
>>> compile(b'#coding:utf-8\n\xfc', '', 'exec')
zsh: segmentation fault (core dumped)  ./python

The problem is that tokenizer.c:verify_identifer doesn't check
return value from PyUnicode_DecodeUTF8 but some invalid utf8
sequences could be there.

--
components: Unicode
keywords: py3k
messages: 55335
nosy: hyeshik.chang
priority: high
severity: normal
status: open
title: Ill-coded identifier crashes python when coding spec is utf-8
type: crash
versions: Python 3.0

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1037>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643 support, a *huge* charset, in cjkcodecs

2008-02-11 Thread Hye-Shik Chang

New submission from Hye-Shik Chang:

This patch adds CNS11643 support into Python unicode codecs.
CNS11643 is a huge character which is used in EUC-TW and ISO-2022-CN.
CJKCodecs have had the CNS11643 support for 4 years at least,
but I dropped it because of its huge size in integrating into Python.
EUC-TW and ISO-2022-CN aren't being used widely while they are
still regarded as part of major encodings yet.

In my patch, disabling the CNS11643 charset support is possible by
adding -DNO_CNS11643 in CFLAGS for light platforms. Mapping source
code size of the charset is 900K and it adds about 350K into
_codecs_tw.so (in POSIX) or python26.dll (in Win32).

What do you think about adding this code?

--
components: Unicode
files: cns11643-r1.diff.gz
messages: 62282
nosy: hyeshik.chang
priority: low
severity: normal
status: open
title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs
versions: Python 2.6, Python 3.0
Added file: http://bugs.python.org/file9408/cns11643-r1.diff.gz

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2066>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Hye-Shik Chang

Changes by Hye-Shik Chang:


--
title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs -> Adding 
new CNS11643, a *huge* charset, support in cjkcodecs

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2066>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-11 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I've generated the mapping table from ICU's CNS11643-1992 mapping.
I see that CNS11643 is quite rarely used in the internet, but it's the
only national standard character set in Taiwan.  Asking Taiwanese
python users, even they didn't think that it's necessary to add into
Python.  I'll study how much compression is possible and how efficient
it is, then submit a revised patch again.

Thank you for comments!

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2066>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I have generated compressed mapping tables by several ways.

I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.

The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )

noneminimal MSjkMSall   current
Text0   207 312 342 570 
Data904 696 592 562 333 

raw-py  3006239220161932996 
zip-py  720 496 416 384 304 

raw-pyc 952 734 624 590 346 
zip-pyc 560 384 336 304 240 
Text+zip-pyc560 591 648 646 810 

raw-both39543124263825201340
zip-both1248864 736 672 512 
   
zip-bare560 384 336 304 240 
tarbz2-bare 496 352 320 304 240 

Columns represent which mapping files are separated into external
files.  In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out.  And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data.  We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.

"Text" is portion that remains in static const C data where is all
the current mapping tables are in.  As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution.  "Data" is externally managed
mapping tables.

"raw-py" row shows total volume of mapping tables as in Python
source code.  "raw-pyc" shows compiled (pyc) version of mapping
tables.  "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively.  Those can be imported
using python zipimport machinery.

"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.

We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.

What do you think about the scheme or
Any other idea for compression?

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2066>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2008-02-14 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I couldn't find an appropriate method to implement in situ
compressed mapping table.  AFAIK, python has the smallest
mapping table footprint for each charset among major open
source transcoding programs.  I have thought about the
compression many times, but every neat method required
severe performance sacrifice.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2066>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2008-02-24 Thread Hye-Shik Chang

Hye-Shik Chang added the comment:

I'll take this.

--
assignee: lemburg -> hyeshik.chang
nosy: +hyeshik.chang

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2008-06-26 Thread Hye-Shik Chang

Hye-Shik Chang <[EMAIL PROTECTED]> added the comment:

Added a patch that implements codecs for CJK Macintosh encodings.
I tried to implement that just alike the other existing CJK codecs,
but it required many inefficient mapping tables due to their odd
mappings (like this: u'ABCDE' <-> 'ab' AND u'ABCD' <-> 'ac'!).

So, I decided to implement a general extension codec wrapper that
can be easily modified by dictionaries given by Python code.
Because all Mac CJK encodings have codecs that implement their base
encodings, I just put their difference in Python codec code.
The extension mechanism may be reused in customized codecs for
in-house applications or legacy encoding supports.

The first patch was generated for 2.6 trunk.  I'm working on porting
it to 3.0.

Added file: http://bugs.python.org/file10743/maccjkcodecs-1.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2008-06-26 Thread Hye-Shik Chang

Changes by Hye-Shik Chang <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file10749/maccjkcodecs-1-py3k.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2008-08-19 Thread Hye-Shik Chang

Changes by Hye-Shik Chang <[EMAIL PROTECTED]>:


Added file: http://bugs.python.org/file11170/cjkmactemporary.diff

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2008-08-23 Thread Hye-Shik Chang

Hye-Shik Chang <[EMAIL PROTECTED]> added the comment:

Committed patch "cjkmactemporary.diff" as r65988 in the py3k branch.
I'll open another issue for cjkcodecs implementation of Mac codecs.

--
resolution:  -> fixed
status: open -> closed

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3685] Crash while compiling Python 3000 in OpenBSD 4.4

2008-08-26 Thread Hye-Shik Chang

Hye-Shik Chang <[EMAIL PROTECTED]> added the comment:

This problem is due to OpenBSD's libc bug.
It's fixed 3 days ago. (http://www.openbsd.org/cgi-
bin/cvsweb/src/lib/libc/string/wcschr.c#rev1.4)
We can workaround by replacing use of wcschr(ws, L'\0') to ws + 
wcslen(ws).

--
nosy: +hyeshik.chang

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3685>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3594] PyTokenizer_FindEncoding() never succeeds

2008-09-03 Thread Hye-Shik Chang

Hye-Shik Chang <[EMAIL PROTECTED]> added the comment:

pitrou, that's because Python source code can't be correctly tokenized 
when it's encoded in few odd encodings like iso-2022 or shift-jis which 
utilizes \, (, ) and " as second byte of two-byte character sequence.

For example, '\x81\\' is HORIZONTAL BAR in shift-jis,

exec('print "\x81\\"')

fails. because of " is ignored by second byte of '\x81\\'.

--
nosy: +hyeshik.chang

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3594>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2066] Adding new CNS11643, a *huge* charset, support in cjkcodecs

2009-03-17 Thread Hye-Shik Chang

Hye-Shik Chang  added the comment:

When I asked Taiwanese developers how often they use these character
sets, it appeared that they are almost useless in the usual computing
environment in Taiwan.  This will only serve for a historical
compatibility and literal standard compliance.  I'm quite neutral in
adding this into python without any user's request from Taiwan (I'm from
South Korea :), but I can finish committing it with pleasure if you are
still fond of the codec.

--

___
Python tracker 
<http://bugs.python.org/issue2066>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5640] Wrong print() result when unicode error handler is not 'strict'

2009-04-01 Thread Hye-Shik Chang

Hye-Shik Chang  added the comment:

Right.
Here I upload a patch to fix the addressed problem on cjkcodecs.
Please test whether the patch corrects the behavior.

--
keywords: +patch
Added file: http://bugs.python.org/file13572/cjkcodecs-fix-statefulenc.diff

___
Python tracker 
<http://bugs.python.org/issue5640>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5640] Wrong print() result when unicode error handler is not 'strict'

2009-04-01 Thread Hye-Shik Chang

Hye-Shik Chang  added the comment:

Sorry. I just found that the fix breaks few other test units.
I'll check.

--

___
Python tracker 
<http://bugs.python.org/issue5640>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com