On Feb 26, 2:01 pm, Kirk Sluder <[EMAIL PROTECTED]> wrote: > In article <[EMAIL PROTECTED]>, > Christian Sonne <[EMAIL PROTECTED]> wrote: > > > Thanks to all of you for your replies - they have been most helpful, and > > my program is now running at a reasonable pace... > > > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it > > turns out to misbehave in further testing, I'll know where to turn :-P > > Anything with variable-length wildcard matching (*+?) is going to > drag your performance down. There was an earlier thread on this very > topic. Another stupid question is how are you planning on handling > ISBNs formatted with hyphens for readability?
According to the OP's first message, 2nd paragraph: """ (it should be noted that I've removed all '-'s in the string, because they have a tendency to be mixed into ISBN's) """ Given a low density of ISBNs in the text, it may well be better to avoid the preliminary pass to rip out the '-'s, and instead: 1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s inside the number) 2. post-process the matches: strip out any '-'s, check for remaining length == 10. Another thought for the OP: Consider (irrespective of how you arrive at a candidate ISBN) validating the ISBN check-digit. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list