On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote: [...] > > Yes, it will double the number of files. Actually quadruple it, if the > > annotations and line numbers are in separate files too. But if most of > > those extra files never need to be opened, then there's no cost to them. > > And whatever extra cost there is, is amortized over the lifetime of the > > interpreter. > > Yes, if they are actually not needed. My question was about whether > that is truly valid.
We're never really going to know the affect on performance without implementing and benchmarking the code. It might turn out that, to our surprise, three quarters of the std lib relies on loading docstrings during startup. But I doubt it. > Consider a very common use-case: an OS-provided > Python interpreter whose files are all owned by 'root'. Those will be > distributed with .pyc files for performance, but you don't want to > deprive the users of help() and anything else that needs docstrings > etc. So... are the docstrings lazily loaded or eagerly loaded? What relevance is that they're owned by root? > If eagerly, you've doubled the number of file-open calls to initialize > the interpreter. I do not understand why you think this is even an option. Has Serhiy said something that I missed that makes this seem to be on the table? That's not a rhetorical question -- I may have missed something. But I'm sure he understands that doubling or quadrupling the number of file operations during startup is not an optimization. > (Or quadrupled, if you need annotations and line > numbers and they're all separate.) If lazily, things are a lot more > complicated than the original description suggested, and there'd need > to be some semantic changes here. What semantic change do you expect? There's an implementation change, of course, but that's Serhiy's problem to deal with and I'm sure that he has considered that. There should be no semantic change. When you access obj.__doc__, then and only then are the compiled docstrings for that module read from the disk. I don't know the current implementation of .pyc files, but I like Antoine's suggestion of laying it out in four separate areas (plus header), each one marshalled: code docstrings annotations line numbers Aside from code, which is mandatory, the three other sections could be None to represent "not available", as is the case when you pass -00 to the interpreter, or they could be some other sentinel that means "load lazily from the appropriate file", or they could be the marshalled data directly in place to support byte-code only libraries. As for the in-memory data structures of objects themselves, I imagine something like the __doc__ and __annotation__ slots pointing to a table of strings, which is not initialised until you attempt to read from the table. Or something -- don't pay too much attention to my wild guesses. The bottom line is, is there some reason *aside from performance* to avoid this? Because if the performance is worse, I'm sure Serhiy will be the first to dump this idea. -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/