Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a 
mechanism that supports data persistence of a subset of python date types with 
mmap, therefore can reduce package import time by caching code object. This 
could be seen as a more eager pyc format, as they are for the same purpose, but 
our approach try to avoid [de]serialization. Therefore, we get a speedup in 
overall python startup by ~15%.

Currently, we’ve made it a third-party library and have been working on 
open-sourcing.

Our implementation (whose non-official name is “pycds”) mainly contains two 
parts:
importlib hooks, this implements the mechanism to dump code objects to an 
archive and a `Finder` that supports loading code object from mapped memory.
Dumping and loading (subset of) python types with mmap. In this part, we deal 
with 1) ASLR by patching `ob_type` fields; 2) hash seed randomization by 
supporting only basic types who don’t have hash-based layout (i.e. dict is not 
supported); 3) interned string by re-interning strings while loading mmap 
archive and so on.

After pycds has been installed, complete workflow of our approach includes 
three parts:
Record name of imported packages to heap.lst, `PYCDSMODE=TRACE 
PYCDSLIST=heap.lst python run.py`
Dump memory archive of code objects of imported packages, this step does not 
involve the python script, `PYCDSMODE=DUMP PYCDSLIST=heap.lst 
PYCDSARCHIVE=heap.img python`
Run other python processes with created archive, `PYCDSMODE=SHARE 
PYCDSARCHIVE=heap.img python run.py`

We could even make use of immortal objects if PEP 683 [2] was accepted, that 
could gives CDS more performance improvements. Currently, any archived object 
is virtually immortal, we add rc by 1 to who has been copied to the archive to 
avoid being deallocated. However, without changes to CPython, rc fields of 
archived object will still be updated, therefore have extra footprint due to 
CoW.

More background and detail implementation could be found at [1].
We think it could be an effective way to improve python’s startup performance, 
and could even do more like sharing large data between python instances.
We’re welcome for suggestions and questions.

Best,
Yichen Yan
Alibaba Compiler Group

[1] “Faster startup -- Share code objects from memory-mapped file”, 
https://github.com/faster-cpython/ideas/discussions/150
[2] PEP 683: "Immortal Objects, Using a Fixed Refcount" (draft), 
https://mail.python.org/archives/list/python-...@python.org/message/TPLEYDCXFQ4AMTW6F6OQFINSIFYBRFCR/
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/UKEBNHXYC3NPX36NS76LQZZYLRA4RVEJ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to