[issue23050] Add Japanese legacy encodings

2014-12-15 Thread STINNER Victor

STINNER Victor added the comment:

 These character encodings are legacy, but are still used.

Do you have an idea of how many users still have documents stored or exchanged 
using these encodings? The patch is not trivial, the legacy japanese codecs are 
complex and so error prone :-/

For previous requests to add new codecs, we closed issues as wontfix and we 
suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more 
complex because C code is modified to implement the new encodings.

$ diffstat issue23050_13417.diff 
 Doc/library/codecs.rst   |   16 
 Lib/encodings/aliases.py |   26 
 Lib/test/test_codecencodings_iso2022.py  |   59 +
 Lib/test/test_codecs.py  |2 
 Lib/test/test_multibytecodec.py  |6 
 Lib/test/test_xml_etree.py   |4 
 Modules/cjkcodecs/_codecs_iso2022.c  |  718 ++-
 Modules/cjkcodecs/_codecs_jp.c   |  305 +
 Modules/cjkcodecs/mappings_jp.h  |  950 ++-
 Modules/cjkcodecs/multibytecodec.h   |   11 
 Python/importlib.h   |  860 ++--
 b/Lib/encodings/cp50220.py   |   39 +
 b/Lib/encodings/cp50221.py   |   39 +
 b/Lib/encodings/cp50222.py   |   39 +
 b/Lib/encodings/cp51932.py   |   39 +
 b/Lib/encodings/eucjp_ms.py  |   39 +
 b/Lib/encodings/iso2022_jp_ms.py |   39 +
 b/Lib/test/cjkencodings/cp50220-utf8.txt |   30 
 b/Lib/test/cjkencodings/cp50220.txt  |   30 
 b/Modules/cjkcodecs/mappings_cp50220_k.h |   31 +
 20 files changed, 2452 insertions(+), 830 deletions(-)

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-15 Thread Tetsuya Morimoto

Tetsuya Morimoto added the comment:

 These character encodings are legacy, but are still used.

 Do you have an idea of how many users still have documents stored or 
 exchanged using these encodings?

Hmm, I guess iso-2022-jp codec is still default charset of MUA (Mail
User Agent) on Japanese Windows platform. But I'm not sure how many so
I'll investigate, wait a few days.

 The patch is not trivial, the legacy japanese codecs are complex and so error 
 prone :-/

Ya, this patch has some refactoring. However, existing tests have
passed and adding encoding codecs wouldn't affect other codecs
basically. Why do you think it's error plone?

 For previous requests to add new codecs, we closed issues as wontfix and we 
 suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more 
 complex because C code is modified to implement the new encodings.

Could you show me previous requests? I can understand C code modifying
is higher cost to review. However, we have codec tests and it wouldn't
affect other codecs, I think.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-15 Thread STINNER Victor

STINNER Victor added the comment:

I refactored some parts of CJK codecs for performances, after the PEP 393
was implemented. A blocker point was that these codecs have very few tests.
Not for valid data but for invalid data. It may be a little bit better. I
tried to write a test for each path in if/else, to test all cases, in the
codecs that I modified.

By error prone, it mean that it's easy to introduce a bug or a regressio,
since the code is complex and almost nobody maintains it.

I'm not stongly opposed to any change. I'm just trying to understand the
context.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-15 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Another traditional issue with Japanese codecs is that people have different 
opinions on what the encoding should do. It may be that when we release the 
codec, somebody comes up and says that the codec is incorrect, and it should do 
something different for some code points, citing some other applications which 
he considers right. In particular for the Microsoft ones, people may claim that 
some version of Windows did things differently.

Now, for this set, the ones that got registered with IANA sound ok (in the 
sense that it is our bug if they fail to conform to the IANA spec, and IANA's 
fault if they fail to do what users expect). For the other ones, I wonder 
whether there is some official source that can be consulted for correctness.

On a different note: why do you claim that the code is written by Perky? (it's 
not you, is it?)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-15 Thread Tetsuya Morimoto

Tetsuya Morimoto added the comment:

 By error prone, it mean that it's easy to introduce a bug or a regression,
 since the code is complex and almost nobody maintains it.

Indeed. Actually, I encountered some faults when I migrated original
patch. The character encoding is a kind of specialty area. This patch
is written by Masayuki Moriyama, who is an expert of character
encoding and he have been contributed to various communities for a
long time. Also, he helps me to migrate original patch(for Python
2.4.3) to Python 3.5. You can see commit log he fixed some bugs.
https://bitbucket.org/t2y/cpython/commits/all

 I'm not stongly opposed to any change. I'm just trying to understand the
 context.

Thanks. I'll help it by explaining the context.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-15 Thread Tetsuya Morimoto

Tetsuya Morimoto added the comment:

 Another traditional issue with Japanese codecs is that people have different 
 opinions on what the encoding should do. It may be that when we release the 
 codec, somebody comes up and says that the codec is incorrect, and it should 
 do something different for some code points, citing some other applications 
 which he considers right. In particular for the Microsoft ones, people may 
 claim that some version of Windows did things differently.

In regard to e-mail encoding, Japanese should use utf-8, then it
resolves most problems. However, for historical reason or
compatibility reason, it's different even today. I don't think these
legacy codecs are needed for individual application, but we sometimes
encounter an encoding issue when an application collaborates to
external system like e-mail.

 Now, for this set, the ones that got registered with IANA sound ok (in the 
 sense that it is our bug if they fail to conform to the IANA spec, and IANA's 
 fault if they fail to do what users expect). For the other ones, I wonder 
 whether there is some official source that can be consulted for correctness.

Exactly. Now, I'm finding euc-jp-ms and iso-2022-jp-ms spec in
English. Of course, there's a voluntary document in Japanese as
follows.
http://www.wdic.org/w/WDIC/eucJP-ms
http://www.wdic.org/w/WDIC/ISO-2022-JP-MS

I may agree with dropping character encoding which is difficult to
find official source.

 On a different note: why do you claim that the code is written by Perky? 
 (it's not you, is it?)

Right! Because the credit belongs to him. I'm an assistant.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-14 Thread Tetsuya Morimoto

New submission from Tetsuya Morimoto:

This patch adds Japanese legacy encodings as below.
https://bitbucket.org/t2y/cpython/branches/compare/japanese-legacy-encoding..default

* eucjp_ms (euc-jp compatible with cp932)
* iso2022_jp_ms (yet another iso-2022-jp compatible with cp932, similar to 
cp50220)
* cp50220 (http://www.iana.org/assignments/charset-reg/CP50220)
* cp50221 (a variant of cp50220)
* cp50222 (a variant of cp50220)
* cp51932 (http://www.iana.org/assignments/charset-reg/CP51932)

Originally, these character encodings patch was created as result in IPA 
project in 2005, by Masayuki Moriyama. The result was contributed to several 
community: libiconv, glibc, perl, PHP, Ruby, PostgreSQL, MySQL, nkf. He had 
made a patch for Python 2.4.3 at that time, but somehow, no one worked to 
integrate. That's a crying shame.

These character encodings are legacy, but are still used. Lots of end-user 
don't care the character encoding. Unfortunately, for historical reason, 
e-mails are encoded with these legacy encodings on Japanese Windows platform. 
Actually, my customer recently reported about Mojibake since its e-mail data 
would be encoded with cp50220 (iso-2022-jp-ms).

References:

* About IPA: http://www.ipa.go.jp/english/about/summary.html
* Mojibake: http://en.wikipedia.org/wiki/Mojibake
* Java encoding names: 
http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

References in Japanese:

* Japanese Legacy Encoding Project: http://legacy-encoding.sourceforge.jp/wiki/
* Project details: 
http://www.ipa.go.jp/about/jigyoseika/05fy-pro/open/2005-1467d.pdf

--
components: Library (Lib)
files: add-japanese-legacy-encoding1.patch
hgrepos: 285
keywords: patch
messages: 232638
nosy: ishimoto, naoki, t2y
priority: normal
severity: normal
status: open
title: Add Japanese legacy encodings
type: enhancement
versions: Python 3.5
Added file: http://bugs.python.org/file37447/add-japanese-legacy-encoding1.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-14 Thread R. David Murray

R. David Murray added the comment:

In emails these are labeled as, say, iso-2022-jp-ms?

See also issue 8898 with regards to email encodings.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-14 Thread Tetsuya Morimoto

Tetsuya Morimoto added the comment:

On Mon, Dec 15, 2014 at 1:04 AM, R. David Murray rep...@bugs.python.org wrote:
 In emails these are labeled as, say, iso-2022-jp-ms?

No. These are labeled just 'iso-2022-jp' and we (japanese) choose
proper charset encoding to decode the encoded text. You can see
several variants of iso-2022-jp. Yes, that's a very strange, but it's
a historical reason.

http://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets

 See also issue 8898 with regards to email encodings.

Therefore, this is different issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23050] Add Japanese legacy encodings

2014-12-14 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +lemburg, loewis, serhiy.storchaka
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23050
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com