On Tue, 10 Nov 2020 22:08:54 +1100 Cameron Simpson <c...@cskk.id.au> wrote:
> On 10Nov2020 10:07, Manfred Lotz <ml_n...@posteo.de> wrote: > >On Tue, 10 Nov 2020 18:37:54 +1100 > >Cameron Simpson <c...@cskk.id.au> wrote: > >> Use os.walk for trees. scandir does a single directory. > > > >Perhaps better. I like to use os.scandir this way > > > >def scantree(path: str) -> Iterator[os.DirEntry[str]]: > > """Recursively yield DirEntry objects (no directories) > > for a given directory. > > """ > > for entry in os.scandir(path): > > if entry.is_dir(follow_symlinks=False): > > yield from scantree(entry.path) > > > > yield entry > > > >Worked fine so far. I think I coded it this way because I wanted the > >full path of the file the easy way. > > Yes, that's fine and easy to read. Note that this is effectively a > recursive call though, with the associated costs: > > - a scandir (or listdir, whatever) has the directory open, and holds > it open while you scan the subdirectories; by contrast os.walk only > opens one directory at a time > > - likewise, if you're maintaining data during a scan, that is held > while you process the subdirectories; with an os.walk you tend to do > that and release the memory before the next iteration of the main > loop (obviously, depending exactly what you're doing) > > However, directory trees tend not to be particularly deep, and the > depth governs the excess state you're keeping around. > Very interesting information. Thanks a lot for this. I will take a closer look at os.walk. > >> > - check if a file is a text file > >> > >> This requires reading the entire file. You want to check that it > >> consists entirely of lines of text. In your expected text encoding > >> - these days UTF-8 is the common default, but getting this correct > >> is essential if you want to recognise text. So as a first cut, > >> totally untested: > >> > >> ... > > > >The reason I want to check if a file is a text file is that I don't > >want to try replacing patterns in binary files (executable binaries, > >archives, audio files aso). > > Exactly, which is why you should not trust, say, the "file" utility. > It scans only the opening part of the file. Great for rejecting > files, but not reliable for being _sure_ about the whole file being > text when it doesn't reject. > > >Of course, to make this nicely work some heuristic check would be the > >right thing (this is what file command does). I am aware that an > >heuristic check is not 100% but I think it is good enough. > > Shrug. That is a risk you must evaluate yourself. I'm quite paranoid > about data loss, myself. If you've got backups or are working on > copies the risks are mitigated. > > You could perhaps take a more targeted approach: do your target files > have distinctive file extensions (for example, all the .py files in a > source tree). > There are some distinctive file extensions. The reason I am satisfieg with heuristics is that the string to change is pretty long so that there is no real danger if I try to change in a binary file because that string it not to be found in binary files. The idea to skip binary files was simply to save time. -- Manfred -- https://mail.python.org/mailman/listinfo/python-list