[issue33578] cjkcodecs missing getstate and setstate implementations

2018-11-05 Thread INADA Naoki


Change by INADA Naoki :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-11-01 Thread INADA Naoki


INADA Naoki  added the comment:


New changeset 488c0a6cdf09e21774e63c2a430ecc0de804d147 by INADA Naoki 
(Christopher Thorne) in branch 'master':
bpo-33578: Fix getstate/setstate for CJK decoder (GH-10290)
https://github.com/python/cpython/commit/488c0a6cdf09e21774e63c2a430ecc0de804d147


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-11-01 Thread Christopher Thorne


Change by Christopher Thorne :


--
pull_requests: +9600

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-11-01 Thread miss-islington


miss-islington  added the comment:


New changeset ac22f6aa989f18c33c12615af1c66c73cf75d5e7 by Miss Islington (bot) 
(Christopher Thorne) in branch 'master':
bpo-33578: Add getstate/setstate for CJK codec (GH-6984)
https://github.com/python/cpython/commit/ac22f6aa989f18c33c12615af1c66c73cf75d5e7


--
nosy: +miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-06-11 Thread INADA Naoki


Change by INADA Naoki :


--
type: behavior -> enhancement
versions:  -Python 2.7, Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-06-07 Thread Christopher Thorne


Christopher Thorne  added the comment:

Thanks Naoki, that simplifies things a lot.

I've updated the PR with a new test case for ISO-2022-JP and a corresponding 
implementation for the encoder state methods.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-06-06 Thread INADA Naoki


INADA Naoki  added the comment:

> `MultibyteCodec_State` can occupy 8 bytes, and `pending` can occupy 2 bytes 
> (MAXENCPENDING), we get a total of 10 bytes which I think exceeds what a 
> PyLong can represent.

PyLong is "long integer", aka "big integer", not C's long type.
https://docs.python.org/3.6/c-api/long.html

You can encode 12 bytes into single long object.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-06-05 Thread Christopher Thorne


Christopher Thorne  added the comment:

Ah, good find. I suppose that means `MultibyteCodec_State` and `pending` are 
both needed to fully capture state, as is done in `decoder.getstate`/`setstate` 
by returning a tuple of both. Unfortunately `encoder.getstate` is defined to 
return an integer, and because `MultibyteCodec_State` can occupy 8 bytes, and 
`pending` can occupy 2 bytes (MAXENCPENDING), we get a total of 10 bytes which 
I think exceeds what a PyLong can represent.

Returning either `pending` or `MultibyteCodec_State` seems infeasible because 
`setstate` will not know how to process it, and both may be needed together.

Some alternatives could be:

1. If we are restricted to returning an integer, perhaps this integer could be 
an index that references a state in a pool of encoder states stored internally 
(effectively a pointer). Managing this state pool seems quite complex.

2. encoder.getstate could be redefined to return a tuple, but obviously this is 
a breaking change. Backwards compatibility could be somewhat preserved by 
allowing setstate to accept either an integer or tuple.

3. Remove `PyObject *pending` from `MultibyteStatefulEncoderContext` and change 
encoders to only use `MultibyteCodec_State`. Not sure how feasible this is.

I think approach 2 would be simplest and matches the decoder interface. 

Does anyone have any opinions or further alternatives?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-06-05 Thread INADA Naoki

INADA Naoki  added the comment:

ISO-2022-* is "stateful" encoding.  It uses escape sequence to change state.
So keeping only `pending` is not enough.

>>> enc.encode("abcあいう")
b'abc\x1b$B$"$$$&'
>>> enc.getstate()
0
>>> enc.encode("abc")
b'\x1b(Babc'

>>> enc.encode("abcあいう")
b'abc\x1b$B$"$$$&'
>>> enc.getstate()
0
>>> enc.setstate(0)
>>> enc.encode("abc")
b'abc'

I don't know much about other encodings.

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-05-19 Thread Roundup Robot

Change by Roundup Robot :


--
keywords: +patch
pull_requests: +6638
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33578] cjkcodecs missing getstate and setstate implementations

2018-05-19 Thread Christopher Thorne

New submission from Christopher Thorne :

The cjkcodecs module provides support for various Chinese/Japanese/Korean 
codecs through its MultibyteIncrementalDecoder and MultibyteIncrementalEncoder 
classes.

While working with one of these cjkcodecs (euc-jp), I came across an issue 
where calling TextIOWrapper.tell() was causing a decode error on a valid euc-jp 
file:

>>> with open("/tmp/test", "w", encoding="euc-jp") as f:
... f.write("AB\nうえ\n")
...
6
>>> with open("/tmp/test", "r", encoding="euc-jp") as f:
... f.readline()
... f.tell()
... f.readline()
... f.readline()
...
'AB\n'
4
'うえ\n'
Traceback (most recent call last):
  File "", line 5, in 
UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xa4 in position 0: 
incomplete multibyte sequence

Without the tell(), there is no problem:

>>> with open("/tmp/test", "w", encoding="euc-jp") as f:
... f.write("AB\nうえ\n")
...
6
>>> with open("/tmp/test", "r", encoding="euc-jp") as f:
... f.readline()
... f.readline()
... f.readline()
...
'AB\n'
'うえ\n'
''

After looking at _io_TextIOWrapper_tell_impl in textio.c, I understood that 
tell() is not as trivial as one might expect as it is using a "reconstruction 
algorithm" [1] to account for buffered chunk reads. By default, TextIOWrapper 
reads from its underlying stream in chunks of 8092 bytes and then decodes and 
buffers this data for speed efficiency. As a result, TextIOWrapper.tell() can't 
just call tell() on the underlying stream because the stream's tell() will 
usually be too far ahead. Thus, a reconstruction algorithm is used to calculate 
how many bytes of the buffered chunk have actually been read so far by the user.

The reconstruction algorithm could be trivial for single byte codecs - just 
track the number of characters read so far, e.g. each time read() is called. 
However, for multibyte codecs where the number of bytes representing a 
character is not uniform nor reported by the decoder, the character count alone 
isn't sufficient. What CPython does for this is jump to a "snapshot" point 
where the decoder is in a clean state (i.e. not halfway through a multibyte 
read) and feeds it bytes to generate characters, counting as it goes, until the 
number of characters it tracked from user reads are generated. From this, it 
knows the number of bytes it took to reach the point the user is at and can 
calculate the correct tell() value.

Now this is where the problem comes in: the reconstruction algorithm depends on 
a reliable way to detect when the decoder is in a "clean state". The getstate 
method [2], which returns any pending data the decoder has, is used for this. 
However, in the case of cjkcodecs, the getstate implementation is missing. This 
can be seen in the following example:

>>> import codecs
>>> decoder = codecs.getincrementaldecoder('euc_jp')()
... decoder.decode(b'\xa4\xa6') # Decode a complete input sequence
'う'
>>> decoder.getstate() # There should be no state here (working)
(b'', 0)
>>> decoder.decode(b'\xa4') # Decode first half of a partial input sequence
''
>>> decoder.getstate() # There should be state here (not working)
(b'', 0)

Internally, however, the cjkcodecs do hold state. They just don't expose it.

This leads to unexpected results in the reconstruction algorithm, such as the 
tell() bug demonstrated above.

It appears decoder.setstate [3], encoder.getstate [4], encoder.setstate [5], 
are also not implemented for the cjkcodecs. This leads to other issues in code 
that assumes their implementaton is present (e.g. TextIOWrapper.seek).

I will attach a PR shortly with some tests and proposed implementations. This 
is my first time working with the CPython codebase, so apologies if I've 
overlooked anything.

Finally, here is a somewhat related issue with a mention of the same tell() bug 
at the end: https://bugs.python.org/issue25863
However, the main problem identified in that report requires further changes 
than just adding getstate as it's caused by the more fundamental issue of 
encoder and decoder state not being kept in sync.
(Also, I have added the reporter of that issue to the nosy list for this one as 
I assume they have some familiarity with this bug)

[1] https://docs.python.org/3/library/io.html#id3
[2] 
https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.getstate
[3] 
https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.setstate
[4] 
https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.getstate
[5] 
https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.setstate

--
components: IO, Tests
messages: 317106
nosy: libcthorne, martin.panter
priority: normal
severity: normal
status: open
title: cjkcodecs missing getstate and setstate implementations
type: behavior
versions: Python 2.7, Python 3.6, Python 3.7, Python 3.8

___
Python tracker