Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Jun 28, 2014 12:49 PM, Ben Hoyt benh...@gmail.com wrote: But the underlying system calls -- ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? I guess it'd be better to say Windows and Unix-based OSs throughout the PEP? Because all of these (including Mac OS X) are Unix-based. No, Just say POSIX. It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows. The Windows structure also contains FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORDnFileSizeHigh; DWORDnFileSizeLow; It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module. I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().st_file_attributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().st_size. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that is_dir()/is_file()/is_symlink() are free on all systems, but .lstat() is only free on Windows. Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object. No, it doesn't. I might add this to the PEP under possible improvements. However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it. Does it support also bytes filenames on UNIX? Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons. I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames. Crazy idea: would it be possible to convert a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object. The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a Rejected ideas section. I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). See above. Do you plan to continue to maintain your module for Python 3.5, but upgrade your module for the final PEP? Yes, I intend to maintain the standalone scandir module for 2.6 = Python 3.5, at least for a good while. For integration into the Python 3.5 stdlib, the implementation will be integrated into posixmodule.c, of course. Should there be a way to access the full path? -- Should ``DirEntry``'s have a way to get the full path without using ``os.path.join(path, entry.name)``? This is a pretty common pattern, and it may be useful to add pathlib-like ``str(entry)`` functionality. This functionality has also been requested in `issue 13`_ on GitHub. .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference. And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path. Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.__str__() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if __str__() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts? Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 28 Jun 2014, at 21:48, Ben Hoyt wrote: [...] Crazy idea: would it be possible to convert a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object. The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a Rejected ideas section. However, it would be bad to have two implementations of the concept of filename with different attribute and method names. The best way to ensure compatible APIs would be if one class was derived from the other. [...] Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Sat, Jun 28, 2014 at 03:55:00PM -0400, Ben Hoyt wrote: Re is_dir etc being properties rather than methods: [...] The problem with this is that properties look free, they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). I think this one could go either way. Methods look like they actually re-test the value each time you call it. I can easily see people not realising that the value is cached and writing code like this toy example: # Detect a file change. t = the_file.lstat().st_mtime while the_file.lstat().st_mtime == t: sleep(0.1) print(Changed!) I know that's not the best way to detect file changes, but I'm sure people will do something like that and not realise that the call to lstat is cached. Personally, I would prefer a property. If I forget to wrap a call in a try...except, it will fail hard and I will get an exception. But with a method call, the failure is silent and I keep getting the cached result. Speaking of caching, is there a way to freshen the cached values? -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote: Speaking of caching, is there a way to freshen the cached values? Switch to a full Path object instead of relying on the cached DirEntry data. This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote: This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). If it matters that much we *could* simply call it cached_lstat(). It's ugly, but I really don't like the idea of throwing the information away - after all, the fact that we currently throw data away is why there's even a need for scandir. Let's not make the same mistake again... Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 21:45, Paul Moore p.f.mo...@gmail.com wrote: On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote: This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). If it matters that much we *could* simply call it cached_lstat(). It's ugly, but I really don't like the idea of throwing the information away - after all, the fact that we currently throw data away is why there's even a need for scandir. Let's not make the same mistake again... Future-proofing is the reason DirEntry is a full fledged class in the first place, though. Effectively communicating the behavioural difference between DirEntry and pathlib.Path is the main thing that makes me nervous about adhering too closely to the Path API. To restate the problem and the alternative proposal, these are the DirEntry methods under discussion: is_dir(): like os.path.isdir(), but requires no system calls on at least POSIX and Windows is_file(): like os.path.isfile(), but requires no system calls on at least POSIX and Windows is_symlink(): like os.path.islink(), but requires no system calls on at least POSIX and Windows lstat(): like os.lstat(), but requires no system calls on Windows For the almost-certain-to-be-cached items, the suggestion is to make them properties (or just ordinary attributes): is_dir is_file is_symlink What do with lstat() is currently less clear, since POSIX directory scanning doesn't provide that level of detail by default. The PEP also doesn't currently state whether the is_dir(), is_file() and is_symlink() results would be updated if a call to lstat() produced different answers than the original directory scanning process, which further suggests to me that allowing the stat call to be delayed on POSIX systems is a potentially problematic and inherently confusing design. We would have two options: - update them, meaning calling lstat() may change those results from being a snapshot of the setting at the time the directory was scanned - leave them alone, meaning the DirEntry object and the DirEntry.lstat() result may give different answers Those both sound ugly to me. So, here's my alternative proposal: add an ensure_lstat flag to scandir() itself, and don't have *any* methods on DirEntry, only attributes. That would make the DirEntry attributes: is_dir: boolean, always populated is_file: boolean, always populated is_symlink boolean, always populated lstat_result: stat result, may be None on POSIX systems if ensure_lstat is False (I'm not particularly sold on lstat_result as the name, but lstat reads as a verb to me, so doesn't sound right as an attribute name) What this would allow: - by default, scanning is efficient everywhere, but lstat_result may be None on POSIX systems - if you always need the lstat result, setting ensure_lstat will trigger the extra system call implicitly - if you only sometimes need the stat result, you can call os.lstat() explicitly when the DirEntry lstat attribute is None Most importantly, *regardless of platform*, the cached stat result (if not None) would reflect the state of the entry at the time the directory was scanned, rather than at some arbitrary later point in time when lstat() was first called on the DirEntry object. There'd still be a slight window of discrepancy (since the filesystem state may change between reading the directory entry and making the lstat() call), but this could be effectively eliminated from the perspective of the Python code by making the result of the lstat() call authoritative for the whole DirEntry object. Regards, Nick. P.S. We'd be generating quite a few of these, so we can use __slots__ to keep the memory overhead to a minimum (that's just a general comment - it's really irrelevant to the methods-or-attributes question). -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29.06.2014 13:08, Nick Coghlan wrote: On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote: Speaking of caching, is there a way to freshen the cached values? Switch to a full Path object instead of relying on the cached DirEntry data. This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is *really* hard to make intuitive, especially when it *sometimes* returns data that looks fresh (as it on first call on POSIX systems). This bugs me too. An idea I had was adding a keyword argument to scandir which specifies whether stat data should be added to the direntry or not. If the flag is set to True, This would implicitly call lstat on POSIX before returning the DirEntry, and use the available data on Windows. If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows. This is not optimal in cases where the stat information is needed only for some of the DirEntry objects, but would also reduce the required logic in the DirEntry object. Thoughts? Regards, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 06/29/2014 05:28 AM, Nick Coghlan wrote: So, here's my alternative proposal: add an ensure_lstat flag to scandir() itself, and don't have *any* methods on DirEntry, only attributes. That would make the DirEntry attributes: is_dir: boolean, always populated is_file: boolean, always populated is_symlink boolean, always populated lstat_result: stat result, may be None on POSIX systems if ensure_lstat is False (I'm not particularly sold on lstat_result as the name, but lstat reads as a verb to me, so doesn't sound right as an attribute name) +1 -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 06/29/2014 04:12 AM, Jonas Wielicki wrote: If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows. -1 This consistency is unnecessary. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
Chris Angelico ros...@gmail.com writes: On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote: Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd Potentially more efficient and also potentially safer (see 'man openat')... but an enhancement that can wait, if necessary. Introducing the feature later creates unnecessary incompatibilities. Either it should be explicitly rejected in the PEP 471 and something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`). At C level it could be implemented using fdopendir/openat or scandirat. Here's the function description using Argument Clinic DSL: /*[clinic input] os.scandir path : path_t(allow_fd=True, nullable=True) = '.' *path* can be specified as either str or bytes. On some platforms, *path* may also be specified as an open file descriptor; the file descriptor must refer to a directory. If this functionality is unavailable, using it raises NotImplementedError. * dir_fd : dir_fd = None If not None, it should be a file descriptor open to a directory, and *path* should be a relative string; path will then be relative to that directory. if *dir_fd* is unavailable, using it raises NotImplementedError. Yield a DirEntry object for each file and directory in *path*. Just like os.listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. {parameters} It's an error to use *dir_fd* when specifying *path* as an open file descriptor. [clinic start generated code]*/ And corresponding tests (from test_posix:PosixTester), to show the compatibility with os.listdir argument parsing in detail: def test_scandir_default(self): # When scandir is called without argument, # it's the same as scandir(os.curdir). self.assertIn(support.TESTFN, [e.name for e in posix.scandir()]) def _test_scandir(self, curdir): filenames = sorted(e.name for e in posix.scandir(curdir)) self.assertIn(support.TESTFN, filenames) #NOTE: assume listdir, scandir accept the same types on the platform self.assertEqual(sorted(posix.listdir(curdir)), filenames) def test_scandir(self): self._test_scandir(os.curdir) def test_scandir_none(self): # it's the same as scandir(os.curdir). self._test_scandir(None) def test_scandir_bytes(self): # When scandir is called with a bytes object, # the returned entries names are still of type str. # Call `os.fsencode(entry.name)` to get bytes self.assertIn('a', {'a'}) self.assertNotIn(b'a', {'a'}) self._test_scandir(b'.') @unittest.skipUnless(posix.scandir in os.supports_fd, test needs fd support for posix.scandir()) def test_scandir_fd_minus_one(self): # it's the same as scandir(os.curdir). self._test_scandir(-1) def test_scandir_float(self): # invalid args self.assertRaises(TypeError, posix.scandir, -1.0) @unittest.skipUnless(posix.scandir in os.supports_fd, test needs fd support for posix.scandir()) def test_scandir_fd(self): fd = posix.open(posix.getcwd(), posix.O_RDONLY) self.addCleanup(posix.close, fd) self._test_scandir(fd) self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) # call 2nd time to test rewind self.assertEqual( sorted(posix.scandir('.')), sorted(posix.scandir(fd))) @unittest.skipUnless(posix.scandir in os.supports_dir_fd, test needs dir_fd support for os.scandir()) def test_scandir_dir_fd(self): relpath = 'relative_path' with support.temp_dir() as parent: fullpath = os.path.join(parent, relpath) with support.temp_dir(path=fullpath): support.create_empty_file(os.path.join(parent, 'a')) support.create_empty_file(os.path.join(fullpath, 'b')) fd = posix.open(parent, posix.O_RDONLY) self.addCleanup(posix.close, fd) self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) # check that fd is still useful self.assertEqual( sorted(posix.scandir(relpath, dir_fd=fd)), sorted(posix.scandir(fullpath))) -- Akira ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29.06.2014 19:04, Ethan Furman wrote: On 06/29/2014 04:12 AM, Jonas Wielicki wrote: If the flag is set to False, all the fields in the DirEntry will be None, for consistency, even on Windows. -1 This consistency is unnecessary. I’m not sure -- similar to the windows_wildcard option this might be a temptation to write platform dependent code, although possibly by accident (i.e. not reading the docs carefully). -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/j.wielicki%40sotecware.net ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7
On Sat, Jun 28, 2014 at 2:51 AM, Victor Stinner victor.stin...@gmail.com wrote: 2014-06-26 13:04 GMT+02:00 Antoine Pitrou anto...@python.org: For the same reason, I agree with Victor that we should ditch the threading-disabled builds. It's too much of a hassle for no actual, practical benefit. People who want a threadless unicodeless Python can install Python 1.5.2 for all I care. By the way, adding a buildbot for testing Python without thread support is not enough. The buildbot is currently broken since more than one month and nobody noticed :-p I've opened http://bugs.python.org/issue21755 to fix the test a couple of weeks ago. --Berker http://buildbot.python.org/all/builders/AMD64%20Fedora%20without%20threads%203.x/ Ok, I noticed, but I consider that I spent too much time on this minor use case. I prefer to leave such task to someone else :-) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/berker.peksag%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 6/29/2014 5:28 AM, Nick Coghlan wrote: There'd still be a slight window of discrepancy (since the filesystem state may change between reading the directory entry and making the lstat() call), but this could be effectively eliminated from the perspective of the Python code by making the result of the lstat() call authoritative for the whole DirEntry object. +1 to this in particular, but this whole refresh of the semantics sounds better overall. Finally, for the case where someone does want to keep the DirEntry around, a .refresh() API could rerun lstat() and update all the data. And with that (initial data potentially always populated, or None, and an explicit refresh() API), the data could all be returned as properties, implying that they aren't fetching new data themselves, because they wouldn't be. Glenn ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com