My problem is as follows. I want to match urls, and therefore I have a group of long valid domain names in my regex:
.... (?:com|org|net|biz|info|ac|cc|gs|ms| sh|st|tc|tf|tj|to|vg|ad|ae|af|ag| com\.ag|ai|off\.ai|al|an|ao|aq| com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...
However, for a url like kuku.com.to it matches the kuku.com part, while I want it to match the whole kuku.com.to. Notice that both "com" and "com.to" are present in the group above.
1. How do I give precedence for "com.to" over "com" in the above group ? Maybe I can somehow sort it by lexicographic order and then by length, or divide it to a set of sub-groups by length ?
According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the "|" operator is never greedy."
So putting "com.to" before "com" does what you want.
>>> import re >>> re.search(r'com|com\.to', 'kuku.com.to').group() 'com' >>> re.search(r'com\.to|com', 'kuku.com.to').group() 'com.to'
Kent -- http://mail.python.org/mailman/listinfo/python-list