Hi, You wrote a great PEP Ben, thanks :-) But it's now time for comments!
> But the underlying system calls -- ``FindFirstFile`` / > ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? You should add a link to FindFirstFile doc: http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%29.aspx It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we should mimic stat_result recent addition: the new stat_result.file_attributes field. Add DirEntry.file_attributes which would only be available on Windows. The Windows structure also contains FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORD nFileSizeHigh; DWORD nFileSizeLow; It would be nice to expose them as well. I'm no more surprised that the exact API is different depending on the OS for functions of the os module. > * Instead of bare filename strings, it returns lightweight > ``DirEntry`` objects that hold the filename string and provide > simple methods that allow access to the stat-like data the operating > system returned. Does your implementation uses a free list to avoid the cost of memory allocation? A short free list of 10 or maybe just 1 may help. The free list may be stored directly in the generator object. > ``scandir()`` yields a ``DirEntry`` object for each file and directory > in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'`` > pseudo-directories are skipped, and the entries are yielded in > system-dependent order. Each ``DirEntry`` object has the following > attributes and methods: Does it support also bytes filenames on UNIX? Python now supports undecodable filenames thanks to the PEP 383 (surrogateescape). I prefer to use the same type for filenames on Linux and Windows, so Unicode is better. But some users might prefer bytes for other reasons. > The ``DirEntry`` attribute and method names were chosen to be the same > as those in the new ``pathlib`` module for consistency. Great! That's exactly what I expected :-) Consistency with other modules. > Notes on caching > ---------------- > > The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute > is obviously always cached, and the ``is_X`` and ``lstat`` methods > cache their values (immediately on Windows via ``FindNextFile``, and > on first use on Linux / OS X via a ``stat`` call) and never refetch > from the system. > > For this reason, ``DirEntry`` objects are intended to be used and > thrown away after iteration, not stored in long-lived data structured > and the methods called again and again. > > If a user wants to do that (for example, for watching a file's size > change), they'll need to call the regular ``os.lstat()`` or > ``os.path.getsize()`` functions which force a new system call each > time. Crazy idea: would it be possible to "convert" a DirEntry object to a pathlib.Path object without losing the cache? I guess that pathlib.Path expects a full stat_result object. > Or, for getting the total size of files in a directory tree -- showing > use of the ``DirEntry.lstat()`` method:: > > def get_tree_size(path): > """Return total size of files in path and subdirs.""" > size = 0 > for entry in scandir(path): > if entry.is_dir(): > sub_path = os.path.join(path, entry.name) > size += get_tree_size(sub_path) > else: > size += entry.lstat().st_size > return size > > Note that ``get_tree_size()`` will get a huge speed boost on Windows, > because no extra stat call are needed, but on Linux and OS X the size > information is not returned by the directory iteration functions, so > this function won't gain anything there. I don't understand how you can build a full lstat() result without really calling stat. I see that WIN32_FIND_DATA contains the size, but here you call lstat(). If you know that it's not a symlink, you already know the size, but you still have to call stat() to retrieve all fields required to build a stat_result no? > Support > ======= > > The scandir module on GitHub has been forked and used quite a bit (see > "Use in the wild" in this PEP), Do you plan to continue to maintain your module for Python < 3.5, but upgrade your module for the final PEP? > Should scandir be in its own module? > ------------------------------------ > > Should the function be included in the standard library in a new > module, ``scandir.scandir()``, or just as ``os.scandir()`` as > discussed? The preference of this PEP's author (Ben Hoyt) would be > ``os.scandir()``, as it's just a single function. Yes, put it in the os module which is already bloated :-) > Should there be a way to access the full path? > ---------------------------------------------- > > Should ``DirEntry``'s have a way to get the full path without using > ``os.path.join(path, entry.name)``? This is a pretty common pattern, > and it may be useful to add pathlib-like ``str(entry)`` functionality. > This functionality has also been requested in `issue 13`_ on GitHub. > > .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 I think that it would be very convinient to store the directory name in the DirEntry. It should be light, it's just a reference. And provide a fullname() name which would just return os.path.join(path, entry.name) without trying to resolve path to get an absolute path. > Should it expose Windows wildcard functionality? > ------------------------------------------------ > > Should ``scandir()`` have a way of exposing the wildcard functionality > in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The > scandir module on GitHub exposes this as a ``windows_wildcard`` > keyword argument, allowing Windows power users the option to pass a > custom wildcard to ``FindFirstFile``, which may avoid the need to use > ``fnmatch`` or similar on the resulting names. It is named the > unwieldly ``windows_wildcard`` to remind you you're writing power- > user, Windows-only code if you use it. > > This boils down to whether ``scandir`` should be about exposing all of > the system's directory iteration features, or simply providing a fast, > simple, cross-platform directory iteration API. Would it be hard to implement the wildcard feature on UNIX to compare performances of scandir('*.jpg') with and without the wildcard built in os.scandir? I implemented it in C for the tracemalloc module (Filter object): http://hg.python.org/features/tracemalloc Get the revision 69fd2d766005 and search match_filename_joker() in Modules/_tracemalloc.c. The function matchs the filename backward because it most cases, the last latter is enough to reject a filename (ex: "*.jpg" => reject filenames not ending with "g"). The filename is normalized before matching the pattern: converted to lowercase and / is replaced with \ on Windows. It was decided to drop the Filter object to keep the tracemalloc module as simple as possible. Charles-François was not convinced by the speedup. But tracemalloc case is different because the OS didn't provide an API for that. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com