On 10Nov2020 10:07, Manfred Lotz <ml_n...@posteo.de> wrote: >On Tue, 10 Nov 2020 18:37:54 +1100 >Cameron Simpson <c...@cskk.id.au> wrote: >> Use os.walk for trees. scandir does a single directory. > >Perhaps better. I like to use os.scandir this way > >def scantree(path: str) -> Iterator[os.DirEntry[str]]: > """Recursively yield DirEntry objects (no directories) > for a given directory. > """ > for entry in os.scandir(path): > if entry.is_dir(follow_symlinks=False): > yield from scantree(entry.path) > > yield entry > >Worked fine so far. I think I coded it this way because I wanted the >full path of the file the easy way.
Yes, that's fine and easy to read. Note that this is effectively a recursive call though, with the associated costs: - a scandir (or listdir, whatever) has the directory open, and holds it open while you scan the subdirectories; by contrast os.walk only opens one directory at a time - likewise, if you're maintaining data during a scan, that is held while you process the subdirectories; with an os.walk you tend to do that and release the memory before the next iteration of the main loop (obviously, depending exactly what you're doing) However, directory trees tend not to be particularly deep, and the depth governs the excess state you're keeping around. >> > - check if a file is a text file >> >> This requires reading the entire file. You want to check that it >> consists entirely of lines of text. In your expected text encoding - >> these days UTF-8 is the common default, but getting this correct is >> essential if you want to recognise text. So as a first cut, totally >> untested: >> >> ... > >The reason I want to check if a file is a text file is that I don't >want to try replacing patterns in binary files (executable binaries, >archives, audio files aso). Exactly, which is why you should not trust, say, the "file" utility. It scans only the opening part of the file. Great for rejecting files, but not reliable for being _sure_ about the whole file being text when it doesn't reject. >Of course, to make this nicely work some heuristic check would be the >right thing (this is what file command does). I am aware that an >heuristic check is not 100% but I think it is good enough. Shrug. That is a risk you must evaluate yourself. I'm quite paranoid about data loss, myself. If you've got backups or are working on copies the risks are mitigated. You could perhaps take a more targeted approach: do your target files have distinctive file extensions (for example, all the .py files in a source tree). Cheers, Cameron Simpson <c...@cskk.id.au> -- https://mail.python.org/mailman/listinfo/python-list