Hello,
I am writing a script to convert from one font set (used to write in
devanagari) to another. This consists of converting certain key sequences to
others, e.g. Of --> k. To do the conversion process, I've looked at some of the
single-pass multireplacement suggestions posted in the Python Cookbook which
build a large regex string with find patterns separated with "|", i.e. a large
string of alternations (and then they use a dictionary containing the values of
the keys that were put in the regex string).
In the process, I found something unexpected about regex patterns that use
alternation: the match is not greedy. The result depends on which pattern is
listed first.
###
>>> import re
>>> pat=re.compile('this|th')
>>> s='thistle'
>>> pat.match(s).group(0)
'this'
>>> pat=re.compile('th|this')
>>> pat.match(s).group(0)
'th'
###
I read the regex docs about the 'match' function and they say that the
'leftmost non-overlapping occurrences of pattern in string' will be returned. I
can understand why the pattern 'light|sli' (or 'sli|light') will match 'sli' in
text 'slight' (because 'sli' is more to the left than 'light'). But why isn't
the greedy result of 'this' obtained for either of the patterns given in the
example above (between the ###)? Is this because it is trying to return a
pattern that overlaps as little as possible with anything else?
Basically what this is meaning for me so far is that a single-pass solution to
replacing certain text sequences with others cannot always be done unless the
patterns to be found are non-overlapping themselves. In the case of 'light|sli'
even though 'light' appears first in the alternation, 'sli' will be found with
higher priority in sequences that had 'slight' in them since 'sli' is more to
the left.
Any comments on the non-greediness of the alternation pattern or on the
multi-replacement process in general are appreciated.
/c
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor