This is exactly the problem solved by set difference. E.g. `{a, b, c} - {a,
c}`.

This operation is linear on the size of the removed set.

On Sun, Jun 5, 2022, 11:01 AM John Carter <carter.n.j...@gmail.com> wrote:

> I’d like to propose a simple addition to the 'fnmatch' module. Specificity
> a function that returns a list of names that match a pattern AND a list of
> those that don't.
>
> In a recent project I found that I wished to split  a list of files and
> move those that matched a pattern to one folder and those that didn't to
> another folder. I was using fnmatch.filter to select the first set of files
> and then a list comprehension to generate the second set.
>
> For a small number of files (~ 10) this was perfectly adequate. However as
> I needed to process many files (>>10000) the execution time was very
> significant. Profiling the code showed that the time was spent in
> generating the second set. I tried a number of solutions including
> designing a negative filter, walking the file system to find those files
> that had not been moved and using more and more convoluted ways to improve
> the second selection. Eventually I gave in and hacked a local copy of
> fnmatch.py as below:
>
> def split(names, pat):
>     """Return the subset of the list NAMES that match PAT."""
>     """Also returns those names not in NAMES"""
>     result = []
>     notresult = []
>     pat = os.path.normcase(pat)
>     pattern_match = _compile_pattern(pat)
>     if os.path is posixpath:
>         # normcase on posix is NOP. Optimize it away from the loop.
>         for name in names:
>             if not pattern_match(name):
>                 result.append(name)
>             else:
>                 notresult.append(name)
>     else:
>         for name in names:
>             if not pattern_match(os.path.normcase(name)):
>                 result.append(name)
>             else:
>                 notresult.append(name)
>     return result, notresult
>
> The change is the addition of else clauses to the if not pattermath
> statements. This solved the problem and benchmarking showed that it only
> took a very small additional time (20ms for a million strings) to generate
> both lists
>
> Number of tests cases 1000000
> Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’]
> Search for '*A*'
> Test                            Time(sec)       Positive        Negative
> WCmatch.filter          1.953125           26211          0
> filter                         0.328125    14259          0
> split                  0.343750    14259      85741
> List Comp.           270.468751     14259      85741
>
> The list comprehension [x for x in a if x not in b]*, was nearly 900 times
> slower.
>
> ‘fnmatch’ was an appropriate solution to this problem as typing ‘glob’
> style search patterns was easier than having to enter regular expressions
> when prompted by my code.
>
> I would like to propose that split, even though it is very simple, be
> included in the 'fnmatch' module.
>
> John
>
> *a is the original and b is those that match.
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/EZEGFGJOHVHATKDBJ2SWZML62JWT2VE2/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/72KFWQJHIFUJJRH32WIGEZ64BRN7UK6E/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to