I vote for the C implementation. On Fri, Feb 13, 2015 at 2:07 AM, Victor Stinner <victor.stin...@gmail.com> wrote:
> Hi, > > TL,DR: are you ok to add 800 lines of C code for os.scandir(), 4x > faster than os.listdir() when the file type is checked? > > I accepted the PEP 471 (os.scandir) a few months ago, but it is not > implement yet in Python 3.5, because I didn't make a choice on the > implementation. > > Ben Hoyt wrote different implementations: > - full C: os.scandir() and DirEntry are written in C (no change on os.py) > - C+Python: os._scandir() (wrapper for opendir/readdir and > FindFirstFileW/FindNextFileW) in C, DirEntry in Python > - ctypes: os.scandir() and DirEntry fully implemented in Python > > I'm not interested by the ctypes implementation. It's useful for a > third party project hosted at PyPI, but for CPython I prefer to wrap C > functions using C code. > > > In short, the C implementation is faster than the C+Python implementation. > > The issue #22524 (*) is full of benchmark numbers. IMO the most > interesting benchmark is to compare os.listdir() + os.stat() versus > os.scandir() + Direntry.is_dir(). Let me try to summarize results of > this benchmark: > > * C implementation: scandir is at least 3.5x faster than listdir, up > to 44.6x faster on Windows > * C+Python implementation: scandir is not really faster than listdir, > between 1.3x and 1.4x faster > > (*) http://bugs.python.org/issue22524 > > > Ben Hoyt reminded me that os.scandir() (PEP 471) doesn't add any new > feature: pathlib already provides a nice API on top of os and os.path > modules. (You may even notice that DirEntry a much fewer methods ;-)) > The main (only?) purpose of the PEP is performance. > > If os.scandir() is "only" 1.4x faster, I don't think that it is > interesting to use os.scandir() in an application. I guess that all > applications/libraries will want to keep compatibility with Python 3.4 > and older and so will anyway have to duplicate the code to use > os.listdir() + os.stat(). So is it worth to duplicate code for such > small speedup? > > Now I see 3 choices: > > - take the full C implementation, because it's much faster (at least > 3.4x faster!) > - reject the whole PEP 471 (not nice), because it adds too much code > for a minor speedup (not true on Windows: up to 44x faster!) > - take the C+Python implementation, because maintenance matters more > than performances (only 1.3x faster, sorry) > > => IMO the best option is to take the C implementation. What do you think? > > > I'm concerned by the length of the C code: the full C implementations > adds ~800 lines of C code to posixmodule.c. This file is already the > longest C file in CPython. I don't want to make it longer, but I'm not > motived to start to split it. Last time I proposed to split a file > (unicodeobject.c), some developers complained that it makes search > harder. I don't understand this, there are so many tools to navigate > in C code. But it was enough for me to give up on this idea. > > A alternative is to add a new _scandir.c module to host the new C > code, and share some code with posixmodule.c: remove "static" keyword > from required C functions (functions to convert Windows attributes to > a os.stat_result object). That's a reasonable choice. What do you > think? > > > FYI I ran the benchmark on different hardware (SSD, HDD, tmpfs), file > systems (ext4, tmpfs, NFS/ext4), operating systems (Linux, Windows). > > Victor > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com