New submission from STINNER Victor: The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.
Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters. I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4). The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests: - test_capitalize: 'finnish' => 'FInnish' (ascii) - test_casefold: 'ß' => 'ss', 'fi' => 'fi' - test_swapcase: 'fi' => 'FI', 'ß' => 'SS' - test_title: 'fiNNISH' => 'Finnish' - test_upper: 'fi' => 'FI', 'ß' => 'SS' The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes: - test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'finnish' => 'FInnish' - test_casefold: 'ß' => 'ss', 'fi' => 'fi' - test_lower: 'İ' => 'i̇' - test_swapcase: 'fi' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀' - test_title: 'fiNNISH' => 'Finnish' - test_upper: 'fi' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀' ---------- files: case_writer.patch keywords: patch messages: 229497 nosy: haypo priority: normal severity: normal status: open title: Use _PyUnicodeWriter in case_operation() type: performance versions: Python 3.5 Added file: http://bugs.python.org/file36942/case_writer.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue22649> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com