On Feb 25, 7:21 pm, Christian Sonne <[EMAIL PROTECTED]> wrote: > Long story short, I'm trying to find all ISBN-10 numbers in a multiline > string (approximately 10 pages of a normal book), and as far as I can > tell, the *correct* thing to match would be this: > ".*\D*(\d{10}|\d{9}X)\D*.*"
All of those *s are making it work too hard. Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the r"..." not just "....", don't you?] Step 1: Lose the .* off each end -- this is meaningless in the context of a search() or findall() and would slow the re engine down if it doesn't optimise it away. r"\D*(\d{10}|\d{9}X)\D*" Step 2: I presume that the \D* at each (remaining) end is to ensure that you don't pick up a number with 11 or more digits. You only need to test that your presumed ISBN is not preceded/followed by ONE suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b instead of \D r"\b(\d{10}|\d{9}X)\b" Step 3: Now that we have only \b (which matches 0 characters) at each end of the ISBN, we can lose the capturing () r"\b\d{10}|\d{9}X\b" Step 4: In the case of 123456789X, it fails on the X and then scans the 123456789 again -- a tiny waste compared to all the * stuff, but still worth fixing. r"\b\d{9}[0-9X]\b" Give that a whirl and let us know how correct and how fast it is. > > (it should be noted that I've removed all '-'s in the string, because > they have a tendency to be mixed into ISBN's) > > however, on my 3200+ amd64, running the following: > > reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*") You should really get into the habit of using raw strings with re. > isbn10s = reISBN10.findall(contents) > > (where contents is the string) > > this takes about 14 minutes - and there are only one or two matches... How many actual matches and how many expected matches? Note on "and there are only one or two matches": if your data consisted only of valid ISBNs separated by a single space or comma, it would run much faster. It is all the quadratic/exponential mucking about with the in-between bits that slows it down. To demonstrate this, try timing dummy data like "1234567890 " * 1000000 and "123456789X " * 1000000 with your various regexes, and with the step1, step2 etc regexes above. > > if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk > loosing results, but it runs in about 0.3 seconds > > So what's the deal? - why would it take so long to run the correct one? Because you have .*\D* in other words 0 or more occurrences of almost anything followed by 0 or more occurrences of almost anything. Even assuming it ignores the .*, it will find the longest possible sequence of non-digits, then try to match the ISBN stuff. If it fails, it will shorten that sequence of non-digits, try the ISBN stuff again, etc etc until it matches the ISBN stuff or that sequence of non-digits is down to zero length. It will do that for each character position in the file contents. Why is it faster when you change \D to []? Presumably because in your data, sequences of non-digits are longer than sequences of spaces IOW there is less stuff to back-track over. > - especially when a slight modification makes it run as fast as I'd > expect from the beginning... > > I'm sorry I cannot supply test data, in my case, it comes from > copyrighted material - however if it proves needed, I can probably > construct dummy data to illustrate the problem What you call "dummy data", I'd call "test data". You should make a sample of "dummy data" and test that your regex is (a) finding all ISBNs that it should (b) not reporting incorrect matches, *BEFORE* being concerned with speed. > > Any and all guidance would be greatly appreciated, For some light reading :-) borrow a copy of Friedl's book (mentioned in the Python re docs) and read the parts on how backtracking regex engines work. HTH, John -- http://mail.python.org/mailman/listinfo/python-list