Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote: * -1 on including Windows specific globbing support in the API * -0 on including cross platform globbing support in the initial iteration of the API (that could be done later as a separate RFE instead) Agreed. Globbing or filtering support should not hold this up. If that part isn't settled, just don't include it and work out what it should be as a future enhancement. * +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type) +1. IMNSHO, one of the most important part of PEPs: capturing the entire decision process to document the why nots. * regarding why not a 2-tuple, we know from experience that operating systems evolve and we end up wanting to add additional info to this kind of API. A dedicated DirEntry type lets us adjust the information returned over time, without breaking backwards compatibility and without resorting to ugly hacks like those in some of the time and stat APIs (or even our own codec info APIs) * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) fwiw, I wouldn't wait for benchmark numbers. A needless stat call when you've got the information from an earlier API call is already brutal. It is easy to compute from existing ballparks remote file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms. fetch of stat info cached in memory on file server on the local network: ~500us. You can go down further to local system call overhead which can vary wildly but should likely be assumed to be at least 10us. You don't need a benchmark to tell you that adding needless = 500us-100ms blocking operations to your program is bad. :) -gps ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 28 June 2014 16:17, Gregory P. Smith g...@krypto.org wrote: On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote: * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) fwiw, I wouldn't wait for benchmark numbers. A needless stat call when you've got the information from an earlier API call is already brutal. It is easy to compute from existing ballparks remote file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms. fetch of stat info cached in memory on file server on the local network: ~500us. You can go down further to local system call overhead which can vary wildly but should likely be assumed to be at least 10us. You don't need a benchmark to tell you that adding needless = 500us-100ms blocking operations to your program is bad. :) Agreed, but walking even a moderately large tree over the network can really hammer home the point that this offers a significant performance enhancement as the latency of access increases. I've found that kind of comparison can be eye-opening for folks that are used to only operating on local disks (even spinning disks, let alone SSDs) and/or relatively small trees (distro build trees aren't *that* big, but they're big enough for this kind of difference in access overhead to start getting annoying). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7
Hello, On Thu, 26 Jun 2014 22:49:40 +1000 Chris Angelico ros...@gmail.com wrote: On Thu, Jun 26, 2014 at 9:04 PM, Antoine Pitrou anto...@python.org wrote: For the same reason, I agree with Victor that we should ditch the threading-disabled builds. It's too much of a hassle for no actual, practical benefit. People who want a threadless unicodeless Python can install Python 1.5.2 for all I care. Or some other implementation of Python. It's looking like micropython will be permanently supporting a non-Unicode build Yes. (although I stepped away from the project after a strong disagreement over what would and would not make sense, and haven't been following it since). Your patches with my further additions were finally merged. Unicode strings still cannot be enabled by default due to https://github.com/micropython/micropython/issues/726 . Any help with reviewing/testing what's currently available is welcome. If someone wants a Python that doesn't have stuff that the core CPython devs treat as essential, s/he probably wants something like uPy anyway. I hinted it during previous discussions of MicroPython, and would like to say it again, that MicroPython already embraced a lot of ideas rejected from CPython, like GC-only operation (which alone not something to be proud of, but can you start up and do something in 2K heap?) or tagged pointers (https://mail.python.org/pipermail/python-dev/2004-July/046139.html). So, it should be good vehicle to try any unorthodox ideas(*) or implementations. * MicroPython already implements intra-module constants for example. -- Best regards, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
Ben Hoyt benh...@gmail.com writes: Hi Python dev folks, I've written a PEP proposing a specific os.scandir() API for a directory iterator that returns the stat-like info from the OS, *the main advantage of which is to speed up os.walk() and similar operations between 4-20x, depending on your OS and file system.* ... http://legacy.python.org/dev/peps/pep-0471/ ... Specifically, this PEP proposes adding a single function to the ``os`` module in the standard library, ``scandir``, that takes a single, optional string as its argument:: scandir(path='.') - generator of DirEntry objects Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd -- akira ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote: Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd Potentially more efficient and also potentially safer (see 'man openat')... but an enhancement that can wait, if necessary. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? I guess it'd be better to say Windows and Unix-based OSs throughout the PEP? Because all of these (including Mac OS X) are Unix-based. It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows. The Windows structure also contains FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORDnFileSizeHigh; DWORDnFileSizeLow; It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module. I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().st_file_attributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().st_size. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that is_dir()/is_file()/is_symlink() are free on all systems, but .lstat() is only free on Windows. Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object. No, it doesn't. I might add this to the PEP under possible improvements. However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it. Does it support also bytes filenames on UNIX? Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons. I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames. Crazy idea: would it be possible to convert a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object. The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a Rejected ideas section. I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). See above. Do you plan to continue to maintain your module for Python 3.5, but upgrade your module for the final PEP? Yes, I intend to maintain the standalone scandir module for 2.6 = Python 3.5, at least for a good while. For integration into the Python 3.5 stdlib, the implementation will be integrated into posixmodule.c, of course. Should there be a way to access the full path? -- Should ``DirEntry``'s have a way to get the full path without using ``os.path.join(path, entry.name)``? This is a pretty common pattern, and it may be useful to add pathlib-like ``str(entry)`` functionality. This functionality has also been requested in `issue 13`_ on GitHub. .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference. And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path. Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.__str__() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if __str__() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts? Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the wildcard built in os.scandir? It's a good idea, the problem with this is that the Windows wildcard implementation has a bunch of crazy edge cases where *.ext will catch more
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
Re is_dir etc being properties rather than methods: I find this behaviour a bit misleading: using methods and have them return cached results. How much (implementation and/or performance and/or memory) overhead would incur by using property-like access here? I think this would underline the static nature of the data. This would break the semantics with respect to pathlib, but they're only marginally equal anyways -- and as far as I understand it, pathlib won't cache, so I think this has a fair point here. Indeed - using properties rather than methods may help emphasise the deliberate *difference* from pathlib in this case (i.e. value when the result was retrieved from the OS, rather than the value right now). The main benefit is that switching from using the DirEntry object to a pathlib Path will require touching all the places where the performance characteristics switch from memory access to system call. This benefit is also the main downside, so I'd actually be OK with either decision on this one. The problem with this is that properties look free, they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). Some of us discussed this on python-dev or python-ideas a while back, and I think there was general agreement with what I've stated above and therefore they should be methods. But I'll dig up the links and add to a Rejected ideas section. * +1 on a new section in the PEP covering rejected design options (calling it iterdir, returning a 2-tuple instead of a dedicated DirEntry type) Great idea. I'll add a bunch of stuff, including the above, to a new section, Rejected Design Options. * regarding why not a 2-tuple, we know from experience that operating systems evolve and we end up wanting to add additional info to this kind of API. A dedicated DirEntry type lets us adjust the information returned over time, without breaking backwards compatibility and without resorting to ugly hacks like those in some of the time and stat APIs (or even our own codec info APIs) Fully agreed. * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) Don't know if you saw, but there are actually some benchmarks, including one over NFS, on the scandir GitHub page: https://github.com/benhoyt/scandir#benchmarks os.walk() was 23 times faster with scandir() than the current listdir() + stat() implementation on the Windows NFS file system I tried. Pretty good speedup! -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 05:55, Ben Hoyt benh...@gmail.com wrote: Re is_dir etc being properties rather than methods: I find this behaviour a bit misleading: using methods and have them return cached results. How much (implementation and/or performance and/or memory) overhead would incur by using property-like access here? I think this would underline the static nature of the data. This would break the semantics with respect to pathlib, but they're only marginally equal anyways -- and as far as I understand it, pathlib won't cache, so I think this has a fair point here. Indeed - using properties rather than methods may help emphasise the deliberate *difference* from pathlib in this case (i.e. value when the result was retrieved from the OS, rather than the value right now). The main benefit is that switching from using the DirEntry object to a pathlib Path will require touching all the places where the performance characteristics switch from memory access to system call. This benefit is also the main downside, so I'd actually be OK with either decision on this one. The problem with this is that properties look free, they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). Some of us discussed this on python-dev or python-ideas a while back, and I think there was general agreement with what I've stated above and therefore they should be methods. But I'll dig up the links and add to a Rejected ideas section. Yes, only the stuff that *never* needs a system call (regardless of OS) would be a candidate for handling as a property rather than a method call. Consistency of access would likely trump that idea anyway, but it would still be worth ensuring that the PEP is clear on which values are guaranteed to reflect the state at the time of the directory scanning and which may imply an additional stat call. * it would be nice to see some relative performance numbers for NFS and CIFS network shares - the additional network round trips can make excessive stat calls absolutely brutal from a speed perspective when using a network drive (that's why the stat caching added to the import system in 3.3 dramatically sped up the case of having network drives on sys.path, and why I thought AJ had a point when he was complaining about the fact we didn't expose the dirent data from os.listdir) Don't know if you saw, but there are actually some benchmarks, including one over NFS, on the scandir GitHub page: https://github.com/benhoyt/scandir#benchmarks No, I hadn't seen those - may be worth referencing explicitly from the PEP (and if there's already a reference... oops!) os.walk() was 23 times faster with scandir() than the current listdir() + stat() implementation on the Windows NFS file system I tried. Pretty good speedup! Ah, nice! Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com