STINNER Victor added the comment:

I cloned https://github.com/benhoyt/scandir. I understand that the --scandir 
command line option of benchmark.py are these choices:

- generic = call listdir() and then use "yield GenericDirEntry" which caches 
os.stat() and os.lstat() results
- python = ctypes implemented calling opendir/readdir and yields PosixDirEntry 
objects which uses d_type field from readdir() in is_dir(), is_file() and 
is_symlink(). Cache the result of os.stat() and os.lstat()
- c = "scandir_helper" (iterator) implemented in C (Python C API) yielding 
PosixDirEntry objects (same class than the "python" benchmark)


I checked with an assertion: d_type of readdir() is never DT_UNKNOWN on my 
Linux Fedora 20. Statistics of PosixDirEntry on my /usr/share tree:

- 155544 PosixDirEntry instances
- fast-path (use d_type) taken 466632 times in is_dir/is_symlink
- slow-path (need to call os.stat or os.lstat) taken 7828 times in 
is_dir/is_symlink
- os.stat called 7832 times
- os.stat called 0 times

7832 is the number of symbolic links in my /usr/share tree. 95% of entries 
don't need stat() in scandir.walk() when using readdir().

So is_dir() and is_symlink() are approximatively called 3 times per entry: 
scandir.walk() calls is_dir() and is_symlink() on each entry, but is_dir() also 
calls is_symlink() by default (because the default value of the follow_symlinks 
parameter is True).


I ran benchmark.py on my Linux Fedora 20 (Linux kernel 3.14). I have two HDD 
configured as RAID0. I don't think that my disk config is revelant: I also have 
12 GB of memory, I hope that /usr/share tree is fully cached. For example, 
"free -m" tells me that 8.8 GB are cached.

The generic implementation looks inefficient: it is 2 times slower. Is there a 
bug? GenericDirEntry caches os.stat() and os.lstat() result, it should be as 
fast or faster than os.walk(), no? Or is it the cost of a generator?

The "c" implementation is 35% faster than the "python" implementation 
(python=1.170 sec, c=0.762 sec).


Result of benchmark:

haypo@smithers$ python3 setup.py build && for scandir in generic python c; do 
echo; echo "=== $scandir ==="; PYTHONPATH=build/lib.linux-x86_64-3.3/ python3 
benchmark.py /usr/share -c $scandir || break; done
running build
running build_py
running build_ext

=== generic ===
Using very slow generic version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.340s, scandir.walk took 2.471s -- 0.5x as fast

=== python ===
Using slower ctypes version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.318s, scandir.walk took 1.170s -- 1.1x as fast

=== c ===
Using fast C version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.317s, scandir.walk took 0.762s -- 1.7x as fast

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22524>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to