On 2018-10-28 18:51, Karsten Hilbert wrote:
Dear list members,
I cannot figure out why my regular expression does not work as I expect it to:
#---------------------------
#!/usr/bin/python
from __future__ import print_function
import re as regex
rx_works = '\$<[^<:]+?::.*?::\d*?>\$|\$<[^<:]+?::.*?::\d+-\d+>\$'
# it fails if switched around:
rx_fails = '\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$'
line = 'junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$
junk'
print ('')
print ('line:', line)
print ('expected: $<match_A::options A::4>$')
print ('expected: $<match_B::options B::4-5>$')
print ('')
placeholders_in_line = regex.findall(rx_works, line, regex.IGNORECASE)
print('found (works):')
for ph in placeholders_in_line:
print (ph)
print ('')
placeholders_in_line = regex.findall(rx_fails, line, regex.IGNORECASE)
print('found (fails):')
for ph in placeholders_in_line:
print (ph)
#---------------------------
I am sure I simply don't see the problem ?
Here are some of the steps while matching the second regex. (View this
in a monospaced font.)
1:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
2:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
3:
The .*? matches as few characters as possible, initially none.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
4:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
At this point it can't match, so it backtracks.
5:
The .*? matches more characters, including the ":".
After more matching it's like the following.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
6:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
Again it can't match, so it backtracks.
7:
The .*? matches more characters, including the ":".
After more matching it's like the following.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
8:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
Success!
The first choice has matched this:
$<match_A::options A::4>$ junk $<match_B::options B::4-5>$
--
https://mail.python.org/mailman/listinfo/python-list