I've been playing a bit with this trying to collect some data and measure how useful this would be. You can take a look at the script I'm using at: https://github.com/dmoisset/pycstats
What I'm measuring is: 1. Number of objects in the pyc, and how many of those are: * docstrings (I'm using a heuristic here which I'm not 100% sure it is correct) * lnotabs * Duplicate objects; these have not been discussed in this thread before but are another source of optimization I noticed while writing this. Essentially I'm refering to immutable constants that are instanced more than once and could be shared. You can also measure the effect of this optimization across modules and within a single module[1] 2. Bytes used in memory by the categories above (sum of sys.getsizeof() for each category). I'm not measuring anything related to annotations because, as I mentioned before, they are generated piecemeal by executable bytecode so they are hard to separate Running this on my python 3.6 pyc cache I get: $ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py 8645 docstrings, 1705441B 19060 lineno tables, 941702B 59382/202898 duplicate objects for 3101287/18582807 memory size So this means around ~10% of the memory used after loading is used for docstrings, ~5% for lnotabs, and ~15% for objects that could be shared. The sharing assumes we can share betwwen modules, but even doing it within modules, you can get to ~7%. In short, this could mean a 25%-35% reduction in memory use for code objects if the stdlib is a good benchmark. Best, D. [1] Regarding duplicates, I've found some unexpected things within loaded code objects, for example instances of the small integer "1" with different id() than the singleton that cpython normally uses for "1", although most duplicates are some small strings, tuples with argument names, or . Something that could be interesting to write is a "pyc optimizer" that removes duplicates, this should be a gain at a minimal preprocessing cost. On 12 April 2018 at 15:16, Daniel Moisset <dmois...@machinalis.com> wrote: > One implementation difficulty specifically related to annotations, is that > they are quite hard to find/extract from the code objects. Both docstrings > and lnotab are within specific fields of the code object for their > function/class/module; annotations are spread as individual constants > (assuming PEP 563), which are loaded in bytecode through separate > LOAD_CONST statements before creating the function object, and that can > happen in the middle of bytecode for the higher level object (the module or > class containing a function definition). So the change for achieving that > will be more significant than just "add a couple of descriptors to function > objects and change the module marshalling code". > > Probably making annotations fit a single structure that can live in > co_consts could make this change easier, and also make startup of annotated > modules faster (because you just load a single constant instead of one per > argument), this might be a valuable change by itself. > > > > On 12 April 2018 at 11:48, INADA Naoki <songofaca...@gmail.com> wrote: > >> > Finally, loading docstrings and other optional components can be made >> lazy. >> > This was not in my original idea, and this will significantly >> complicate the >> > implementation, but in principle it is possible. This will require >> larger >> > changes in the marshal format and bytecode. >> >> I'm +1 on this idea. >> >> * New pyc format has code section (same to current) and text section. >> text section stores UTF-8 strings and not loaded at import time. >> * Function annotation (only when PEP 563 is used) and docstring are >> stored as integer, point to offset in the text section. >> * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are >> integer, text is loaded from the text section lazily. >> >> PEP 563 will reduce some startup time, but __annotation__ is still >> dict. Memory overhead is negligible. >> >> In [1]: def foo(a: int, b: int) -> int: >> ...: return a + b >> ...: >> ...: >> >> In [2]: import sys >> In [3]: sys.getsizeof(foo) >> Out[3]: 136 >> >> In [4]: sys.getsizeof(foo.__annotations__) >> Out[4]: 240 >> >> When PEP 563 is used, there are no side effect while building the >> annotation. >> So the annotation can be serialized in text, like >> {"a":"int","b":"int","return":"int"}. >> >> This change will require new pyc format, and descriptor for >> PyFunction.__doc__, PyFunction.__annotation__ >> and type.__doc__. >> >> Regards, >> >> -- >> INADA Naoki <songofaca...@gmail.com> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > Daniel F. Moisset - UK Country Manager - Machinalis Limited > www.machinalis.co.uk <http://www.machinalis.com> > Skype: @dmoisset T: + 44 7398 827139 > > 1 Fore St, London, EC2Y 9DT > > Machinalis Limited is a company registered in England and Wales. > Registered number: 10574987. > -- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/