This is exactly the problem solved by set difference. E.g. `{a, b, c} - {a, c}`.
This operation is linear on the size of the removed set. On Sun, Jun 5, 2022, 11:01 AM John Carter <carter.n.j...@gmail.com> wrote: > I’d like to propose a simple addition to the 'fnmatch' module. Specificity > a function that returns a list of names that match a pattern AND a list of > those that don't. > > In a recent project I found that I wished to split a list of files and > move those that matched a pattern to one folder and those that didn't to > another folder. I was using fnmatch.filter to select the first set of files > and then a list comprehension to generate the second set. > > For a small number of files (~ 10) this was perfectly adequate. However as > I needed to process many files (>>10000) the execution time was very > significant. Profiling the code showed that the time was spent in > generating the second set. I tried a number of solutions including > designing a negative filter, walking the file system to find those files > that had not been moved and using more and more convoluted ways to improve > the second selection. Eventually I gave in and hacked a local copy of > fnmatch.py as below: > > def split(names, pat): > """Return the subset of the list NAMES that match PAT.""" > """Also returns those names not in NAMES""" > result = [] > notresult = [] > pat = os.path.normcase(pat) > pattern_match = _compile_pattern(pat) > if os.path is posixpath: > # normcase on posix is NOP. Optimize it away from the loop. > for name in names: > if not pattern_match(name): > result.append(name) > else: > notresult.append(name) > else: > for name in names: > if not pattern_match(os.path.normcase(name)): > result.append(name) > else: > notresult.append(name) > return result, notresult > > The change is the addition of else clauses to the if not pattermath > statements. This solved the problem and benchmarking showed that it only > took a very small additional time (20ms for a million strings) to generate > both lists > > Number of tests cases 1000000 > Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’] > Search for '*A*' > Test Time(sec) Positive Negative > WCmatch.filter 1.953125 26211 0 > filter 0.328125 14259 0 > split 0.343750 14259 85741 > List Comp. 270.468751 14259 85741 > > The list comprehension [x for x in a if x not in b]*, was nearly 900 times > slower. > > ‘fnmatch’ was an appropriate solution to this problem as typing ‘glob’ > style search patterns was easier than having to enter regular expressions > when prompted by my code. > > I would like to propose that split, even though it is very simple, be > included in the 'fnmatch' module. > > John > > *a is the original and b is those that match. > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/EZEGFGJOHVHATKDBJ2SWZML62JWT2VE2/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/72KFWQJHIFUJJRH32WIGEZ64BRN7UK6E/ Code of Conduct: http://python.org/psf/codeofconduct/