New submission from STINNER Victor:

The case_operation() in Objects/unicodeobject.c is used for case operations: 
lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 
300%. The function uses the worst case: one character replaced with 3 
characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most 
common case: each character is replaced by only one another character, and the 
output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in 
some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" 
characters taking the slow path from unit tests:

- test_capitalize: 'finnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_swapcase: 'fi' => 'FI', 'ß' => 'SS'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one 
character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'finnish' 
=> 'FInnish'
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'fi' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

----------
files: case_writer.patch
keywords: patch
messages: 229497
nosy: haypo
priority: normal
severity: normal
status: open
title: Use _PyUnicodeWriter in case_operation()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file36942/case_writer.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22649>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to