[Python-ideas] Re: Addition to fnmatch.py

MRAB Sun, 05 Jun 2022 09:08:48 -0700

On 2022-06-05 16:12, David Mertz, Ph.D. wrote:

This is exactly the problem solved by set difference. E.g. `{a, b, c} -{a, c}`.
This operation is linear on the size of the removed set.

You don't have to use a set difference, which might matter if the orderwas important, but just make a set of the ones found:


    s = set(b)

And then the list comprehension:

    [x for x in a if x not in s]

On Sun, Jun 5, 2022, 11:01 AM John Carter <carter.n.j...@gmail.com<mailto:carter.n.j...@gmail.com>> wrote:


    I’d like to propose a simple addition to the 'fnmatch' module.
    Specificity a function that returns a list of names that match a
    pattern AND a list of those that don't.

    In a recent project I found that I wished to split  a list of files
    and move those that matched a pattern to one folder and those that
    didn't to another folder. I was using fnmatch.filter to select the
    first set of files and then a list comprehension to generate the
    second set.

    For a small number of files (~ 10) this was perfectly adequate.
    However as I needed to process many files (>>10000) the execution
    time was very significant. Profiling the code showed that the time
    was spent in generating the second set. I tried a number of
    solutions including designing a negative filter, walking the file
    system to find those files that had not been moved and using more
    and more convoluted ways to improve the second selection. Eventually
    I gave in and hacked a local copy of fnmatch.py as below:

    def split(names, pat):
         """Return the subset of the list NAMES that match PAT."""
         """Also returns those names not in NAMES"""
         result = []
         notresult = []
         pat = os.path.normcase(pat)
         pattern_match = _compile_pattern(pat)
         if os.path is posixpath:
             # normcase on posix is NOP. Optimize it away from the loop.
             for name in names:
                 if not pattern_match(name):
                     result.append(name)
                 else:
                     notresult.append(name)
         else:
             for name in names:
                 if not pattern_match(os.path.normcase(name)):
                     result.append(name)
                 else:
                     notresult.append(name)
         return result, notresult

    The change is the addition of else clauses to the if not pattermath
    statements. This solved the problem and benchmarking showed that it
    only took a very small additional time (20ms for a million strings)
    to generate both lists

    Number of tests cases 1000000
    Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’]
    Search for '*A*'
    Test                            Time(sec)       Positive        Negative
    WCmatch.filter          1.953125           26211          0
    filter                         0.328125    14259          0
    split                  0.343750    14259      85741
    List Comp.           270.468751     14259      85741

    The list comprehension [x for x in a if x not in b]*, was nearly 900
    times slower.

    ‘fnmatch’ was an appropriate solution to this problem as typing
    ‘glob’ style search patterns was easier than having to enter regular
    expressions when prompted by my code.

    I would like to propose that split, even though it is very simple,
    be included in the 'fnmatch' module.

    John

    *a is the original and b is those that match.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FRQJT7AWCVQXYJPC4IDWA3LZGVLX765A/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Addition to fnmatch.py

Reply via email to