alexk wrote:
My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
                         sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
                         com\.ag|ai|off\.ai|al|an|ao|aq|
                         com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the "|" operator is never greedy."


So putting "com.to" before "com" does what you want.

 >>> import re
 >>> re.search(r'com|com\.to', 'kuku.com.to').group()
'com'
 >>> re.search(r'com\.to|com', 'kuku.com.to').group()
'com.to'

Kent
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to