New submission from Catalin Gabriel Manciu: Hi All,
This is Catalin from the Server Scripting Languages Optimization Team at Intel Corporation. I would like to submit a patch that replaces the 'malloc' allocator used by the list object (Objects/listobject.c) with the small object allocator (obmalloc.c) and simplifies the 'list_resize' function by removing a redundant check and properly handling resizing to zero. Replacing PyMem_* calls with PyObject_* inside the list implementation is beneficial because many PyMem_* calls are made for requesting sizes that are better handled by the small object allocator. For example, when running Tools/pybench.py -w 1 a total of 48.295.840 allocation requests are made by the list implementation (either by using 'PyMem_MALLOC' directly or by calling 'PyMem_RESIZE') out of which 42.581.993 (88%) are requesting sizes that can be handled by the small object allocator (they're equal or less than 512 bytes in size). The changes to 'list_resize' were made in order to further improve performance by removing a redundant check and handling the 'resize to zero' case separately. The 'empty' state of a list is suggested by the 'PyList_New' function as having the 'ob_item' pointer NULL and the 'ob_size' and 'allocated' members equal with 0. Previously, when being called with zero as a size parameter, 'list_resize' would set 'ob_size' and 'allocated' to zero, but it would also call 'PyMem_RESIZE' which, by its design, would call 'realloc' with a size of 1, thus going through the process of allocating an unnecessary 1 byte and setting the 'ob_item' pointer with the newly obtained address. The proposed implementation just deletes the buffer pointed by 'ob_item' and sets 'ob_size', 'allocated' and 'ob_item' to zero when receiving a 'resize to zero' request. Hardware and OS Configuration ============================= Hardware: Intel XEON (Haswell-EP) 36 Cores / Intel XEON (Broadwell-EP) 36 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false OS: Ubuntu 14.04.2 LTS OS configuration: Address Space Layout Randomization (ASLR) disabled to reduce run to run variation by echo 0 > /proc/sys/kernel/randomize_va_space CPU frequency set fixed at 2.3GHz GCC version: GCC version 5.1.0 Benchmark: Grand Unified Python Benchmark from https://hg.python.org/benchmarks/ Measurements and Results ======================== A. Repository: GUPB Benchmark: hg id : 9923b81a1d34 tip hg --debug id -i : 9923b81a1d346891f179f57f8780f86dcf5cf3b9 CPython3: hg id : 733a902ac816 tip hg id -r 'ancestors(.) and tag()': 737efcadf5a6 (3.4) v3.4.4 hg --debug id -i : 733a902ac816bd5b7b88884867ae1939844ba2c5 CPython2: hg id : 5715a6d9ff12 (2.7) hg id -r 'ancestors(.) and tag()': 6d1b6a68f775 (2.7) v2.7.11 hg --debug id -i : 5715a6d9ff12053e81f7ad75268ac059b079b351 B. Results: CPython2 and CPython3 sample results, measured on a Haswell and a Broadwell platform can be viewed in Tables 1, 2, 3 and 4. The first column (Benchmark) is the benchmark name and the second (%D) is the speedup in percents compared with the unpatched version. Table 1. CPython 3 results on Intel XEON (Haswell-EP) @ 2.3 GHz Benchmark %D ---------------------------------- unpickle_list 20.27 regex_effbot 6.07 fannkuch 5.87 mako_v2 5.19 meteor_contest 4.31 simple_logging 3.98 nqueens 3.40 json_dump_v2 3.14 fastpickle 2.16 django_v3 2.03 tornado_http 1.90 pathlib 1.84 fastunpickle 1.81 call_simple 1.75 nbody 1.60 etree_process 1.58 go 1.54 call_method_unknown 1.53 2to3 1.26 telco 1.04 etree_generate 1.02 json_load 0.85 etree_parse 0.81 call_method_slots 0.73 etree_iterparse 0.68 call_method 0.65 normal_startup 0.63 silent_logging 0.56 chameleon_v2 0.56 pickle_list 0.52 regex_compile 0.50 hexiom2 0.47 pidigits 0.39 startup_nosite 0.17 pickle_dict 0.00 unpack_sequence 0.00 formatted_logging -0.06 raytrace -0.06 float -0.18 richards -0.37 spectral_norm -0.51 chaos -0.65 regex_v8 -0.72 Table 2. CPython 3 results on Intel XEON (Broadwell-EP) @ 2.3 GHz Benchmark %D ---------------------------------- unpickle_list 15.75 nqueens 5.24 mako_v2 5.17 unpack_sequence 4.44 fannkuch 4.42 nbody 3.25 meteor_contest 2.86 regex_effbot 2.45 json_dump_v2 2.44 django_v3 2.26 call_simple 2.09 tornado_http 1.74 regex_compile 1.40 regex_v8 1.16 spectral_norm 0.89 2to3 0.76 chameleon_v2 0.70 telco 0.70 normal_startup 0.64 etree_generate 0.61 etree_process 0.55 hexiom2 0.51 json_load 0.51 call_method_slots 0.48 formatted_logging 0.33 call_method 0.28 startup_nosite -0.02 fastunpickle -0.02 pidigits -0.20 etree_parse -0.23 etree_iterparse -0.27 richards -0.30 silent_logging -0.36 pickle_list -0.42 simple_logging -0.82 float -0.91 pathlib -0.99 go -1.16 raytrace -1.16 chaos -1.26 fastpickle -1.72 call_method_unknown -2.94 pickle_dict -4.73 Table 3. CPython 2 results on Intel XEON (Haswell-EP) @ 2.3 GHz Benchmark %D ---------------------------------- unpickle_list 15.89 json_load 11.53 fannkuch 7.90 mako_v2 7.01 meteor_contest 4.21 nqueens 3.81 fastunpickle 3.56 django_v3 2.91 call_simple 2.72 call_method_slots 2.45 slowpickle 2.23 call_method 2.21 html5lib_warmup 1.90 chaos 1.89 html5lib 1.81 regex_v8 1.81 tornado_http 1.66 2to3 1.56 json_dump_v2 1.49 nbody 1.38 rietveld 1.26 formatted_logging 1.12 regex_compile 0.99 spambayes 0.92 pickle_list 0.87 normal_startup 0.82 pybench 0.74 slowunpickle 0.71 raytrace 0.67 startup_nosite 0.59 float 0.47 hexiom2 0.46 slowspitfire 0.46 pidigits 0.44 etree_process 0.44 etree_generate 0.37 go 0.27 telco 0.24 regex_effbot 0.12 etree_iterparse 0.06 bzr_startup 0.04 richards 0.03 etree_parse 0.00 unpack_sequence 0.00 call_method_unknown -0.26 pathlib -0.57 fastpickle -0.64 silent_logging -0.94 simple_logging -1.10 chameleon_v2 -1.25 pickle_dict -1.67 spectral_norm -3.25 Table 4. CPython 2 results on Intel XEON (Broadwell-EP) @ 2.3 GHz Benchmark %D ---------------------------------- unpickle_list 15.44 json_load 11.11 fannkuch 7.55 meteor_contest 5.51 mako_v2 4.94 nqueens 3.49 html5lib_warmup 3.15 html5lib 2.78 call_simple 2.35 silent_logging 2.33 json_dump_v2 2.14 startup_nosite 2.09 bzr_startup 1.93 fastunpickle 1.93 slowspitfire 1.91 regex_v8 1.79 rietveld 1.74 pybench 1.59 nbody 1.57 regex_compile 1.56 pathlib 1.51 tornado_http 1.33 normal_startup 1.21 2to3 1.14 chaos 1.00 spambayes 0.85 etree_process 0.73 pickle_list 0.70 float 0.69 hexiom2 0.51 slowpickle 0.44 call_method_unknown 0.42 slowunpickle 0.37 pickle_dict 0.25 etree_parse 0.20 go 0.19 django_v3 0.12 call_method_slots 0.12 spectral_norm 0.05 call_method 0.01 unpack_sequence 0.00 raytrace -0.08 pidigits -0.11 richards -0.16 etree_generate -0.23 regex_effbot -0.26 telco -0.28 simple_logging -0.32 etree_iterparse -0.38 formatted_logging -0.50 fastpickle -1.08 chameleon_v2 -1.74 ---------- components: Interpreter Core files: listobject_CPython3.patch keywords: patch messages: 260459 nosy: catalin.manciu priority: normal severity: normal status: open title: List object memory allocator type: performance versions: Python 2.7, Python 3.6 Added file: http://bugs.python.org/file41953/listobject_CPython3.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26382> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com