[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-17 Thread STINNER Victor

STINNER Victor added the comment:

The commit changes the default value of min_length when overallocation is 
enabled: it does not use at least 100 characters anymore. It did not directly 
introduce a bug, but the regression comes from 7ed9993d53b4 (use 
_PyUnicodeWriter for Unicode decoders). The following commits should fix these 
issues.

changeset:   83435:94d1c3bdb79c
tag: tip
user:Victor Stinner 
date:Thu Apr 18 00:25:28 2013 +0200
files:   Objects/unicodeobject.c
description:
Fix bug in Unicode decoders related to _PyUnicodeWriter

Bug introduced by changesets 7ed9993d53b4 and edf029fc9591.


changeset:   83434:7eb52460c999
user:Victor Stinner 
date:Wed Apr 17 23:58:16 2013 +0200
files:   Objects/unicodeobject.c
description:
Fix typo in unicode_decode_call_errorhandler_writer()

Bug introduced by changeset 7ed9993d53b4.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-17 Thread Roundup Robot

Roundup Robot added the comment:

New changeset edf029fc9591 by Victor Stinner in branch 'default':
Close #17694: Add minimum length to _PyUnicodeWriter
http://hg.python.org/cpython/rev/edf029fc9591

--
nosy: +python-dev
resolution:  -> fixed
stage:  -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread STINNER Victor

STINNER Victor added the comment:

PyUnicode_DecodeCharmap() still uses _PyUnicodeWriter_Prepare() (even with my 
patch). It may be interesting to benchmark min_length vs prepare. If min_length 
is not slower, it should be used instead of prepare to avoid widen the buffer 
if the first written character is non-ASCII, b'\x80'.decode('cp1252') for 
example.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread STINNER Victor

STINNER Victor added the comment:

PyUnicode_DecodeUnicodeEscape() should set writer.min_length instead of using 
_PyUnicodeWriter_Prepare(), but the following assertion fails (because 
writer.size is zero by default):

assert(writer.pos < writer.size || (writer.pos == writer.size && c == 
'\n'));

I don't understand this assertion, so I don't know how to modify it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread STINNER Victor

STINNER Victor added the comment:

I don't see how issue17694.patch can speedup Python because min_length is zero 
when overallocation is disabled. It may be noise of the benchmark script.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread STINNER Victor

STINNER Victor added the comment:

Attached patch changes _PyUnicodeWriter_Init() API: it now only has one 
argument (the writer). Minimum length and overallocation must be configured 
using attributes. The problem with the old API was that it was not possible to 
configure minimum length and overallocation separatly.

Disable overallocation in CJK decoders: only set the minimum length.

Other changes:

 * Add min_char character to _PyUnicodeWriter. It is currenctly unused. Using 
_PyUnicodeWriter_Prepare(writer, 0, min_char) is different because it allocates 
immediatly the buffer, and calling _PyUnicodeWriter_Prepare() with size=0 is 
not supported (it does not widen the buffer if maxchar is bigger).
 * unicode_decode_call_errorhandler_writer() only enables overallocation if the 
replaced string is longer than 1 character
 * PyUnicode_DecodeRawUnicodeEscape() and _PyUnicode_DecodeUnicodeInternal() 
set minimum length instead of preallocating the whole buffer. It avoids the 
need of widen the buffer if the first written character is the biggest 
character. It also avoids an useless memory allocation if the decoder fails 
before the first write.
 * _PyUnicode_DecodeUnicodeInternal() checks for integer overflow when 
computing the minimum length
 * _PyUnicodeWriter_Update() is now responsible to set size to zero if readonly 
is set

The goal is to delay the first allocation until the first real write to be able 
to choose correctly the maximum character and the buffer size. If the buffer is 
allocated before the first write, even the first write must widen and/or 
enlarge the buffer.

--
Added file: http://bugs.python.org/file29840/writer_minlen.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread Vladimir Korolev

Vladimir Korolev added the comment:

For some reason can't figure out how to attach multiple files.  So here is the 
benchmark module

--
Added file: http://bugs.python.org/file29833/benchmark.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread Vladimir Korolev

Vladimir Korolev added the comment:

We have this issue triaged for at CPython hackathon in Boston.  Here is a patch 
for the issue.  

We only tested on Mac OS X 10.8.3,  which has zoned allocator,  so the memory 
profile is exactly the same with our without this patch.  The running time 
seems to be slightly better with the patch.  The benchmark we used runs for 
about 5.6 seconds with the patch vs.  5.9 seconds without the patch.  We run 
the benchmark multiple times and the results seem to be consistent.

Here are the results of the benchmarking and memory profile testing:

With FixWithout Fix

Mem  1535 nodes (6377296 bytes) 1535 nodes (6378144 bytes)
Time 5.68   5.9 sec



The memory profile was measured by the MacOS X 'heap' command.  The timings 
come from attached benchmark module.  The original benchmark module is taken 
from here http://bugs.python.org/file25558/benchmark.py  and was modified to 
test this issue.

--
keywords: +patch
nosy: +vladistan
Added file: http://bugs.python.org/file29832/issue17694.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-13 Thread Vladimir Korolev

Vladimir Korolev added the comment:

I'd like to note that the actual patch was written by Adam.Duston 
http://bugs.python.org/user17706

I just verified the results, measured the time/memory performance submitted the 
patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17694] Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation

2013-04-10 Thread STINNER Victor

New submission from STINNER Victor:

The _PyUnicodeWriter API is used in many functions to create Unicode strings, 
especially decoders. Performances are not optimal: it is not possible to 
specify the minimum length of the buffer if the overallocation is disabled. It 
may help #17693 for example.

--
messages: 186537
nosy: haypo, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Enhance _PyUnicodeWriter API to control minimum buffer length without 
overallocation
versions: Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com