[Python-ideas] Re: Addition to fnmatch.py

2022-06-06 Thread John Carter
Thanks for all the comments. 
 I did consider using sets but order was important and when partitioning a list 
of general strings, not file names, there could be identical  and important 
strings (strings at different positions could mean different things, especially 
when generating test and training sets). 
I also tried doing it twice (generating lists) and it took approximately twice 
as long though generating two iterables could defer the partitioning till it 
was needed, i.e. lazy evaluation.

I’ve added some of the solutions to by timing test. Identified by capital 
letters at the end.

Reference list lengths 144004  855996
Number of tests cases 100
Example data ['xDy7AWbXau', 'TXlzsZV3Ra', 'YJh8uD9ovK', 'aRJ2U7nWs8', 
'geu.vHlogu']
FNmatch0.671875 268756  0 2687560
WCmatch1.562500 264939  0 2649390
IWCmatch   1.281250  0  0  01
Easy   0.093750 144004 855996 1001
Re 0.328125 855996 144004 1000
Positive   0.281250 144004  0 1440041
Negative   0.328125  0 855996 8559961
UppeeCase  0.375000 268756  0 2687560
Null   0.00  0  0  01
Both   0.328125 144004 855996 1001
Partition  1.171875 855996 144004 1000
IBoth  0.328125 855996 144004 1000
MRAB   0.437500 144004 855996 1001
METZ   0.50 144004 855996 1001
CA 0.343750 144004 855996 1001
SJB0.328125 144004 855996 1001

I checked for order and interestingly all the set solutions preserve it. I 
think this is because there are no duplicates in the test data. Order is only 
checked correctly if thee are the same number of elements in the test and 
reference lists

I also tried more_itertoolls.partition. Nearly 4 times slower.
John
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6WFSIXUF67KU5QXOODCBG7RDDG2KSYSK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Addition to fnmatch.py

2022-06-05 Thread John Carter
I’d like to propose a simple addition to the 'fnmatch' module. Specificity a 
function that returns a list of names that match a pattern AND a list of those 
that don't.

In a recent project I found that I wished to split  a list of files and move 
those that matched a pattern to one folder and those that didn't to another 
folder. I was using fnmatch.filter to select the first set of files and then a 
list comprehension to generate the second set.

For a small number of files (~ 10) this was perfectly adequate. However as I 
needed to process many files (>>1) the execution time was very significant. 
Profiling the code showed that the time was spent in generating the second set. 
I tried a number of solutions including designing a negative filter, walking 
the file system to find those files that had not been moved and using more and 
more convoluted ways to improve the second selection. Eventually I gave in and 
hacked a local copy of fnmatch.py as below:

def split(names, pat):
"""Return the subset of the list NAMES that match PAT."""
"""Also returns those names not in NAMES"""
result = []
notresult = []
pat = os.path.normcase(pat)
pattern_match = _compile_pattern(pat)
if os.path is posixpath:
# normcase on posix is NOP. Optimize it away from the loop.
for name in names:
if not pattern_match(name):
result.append(name)
else:
notresult.append(name)
else:
for name in names:
if not pattern_match(os.path.normcase(name)):
result.append(name)
else:
notresult.append(name)
return result, notresult

The change is the addition of else clauses to the if not pattermath statements. 
This solved the problem and benchmarking showed that it only took a very small 
additional time (20ms for a million strings) to generate both lists

Number of tests cases 100
Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’]
Search for '*A*'
TestTime(sec)   PositiveNegative
WCmatch.filter  1.953125   26211  0
filter 0.32812514259  0
split  0.34375014259  85741
List Comp.   270.468751 14259  85741

The list comprehension [x for x in a if x not in b]*, was nearly 900 times 
slower.

‘fnmatch’ was an appropriate solution to this problem as typing ‘glob’ style 
search patterns was easier than having to enter regular expressions when 
prompted by my code.

I would like to propose that split, even though it is very simple, be included 
in the 'fnmatch' module.

John

*a is the original and b is those that match.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/EZEGFGJOHVHATKDBJ2SWZML62JWT2VE2/
Code of Conduct: http://python.org/psf/codeofconduct/