Re: RegExp performance?

John Machin Sun, 25 Feb 2007 23:46:09 -0800

On Feb 26, 2:01 pm, Kirk  Sluder <[EMAIL PROTECTED]> wrote:
> In article <[EMAIL PROTECTED]>,
>  Christian Sonne <[EMAIL PROTECTED]> wrote:
>
> > Thanks to all of you for your replies - they have been most helpful, and
> > my program is now running at a reasonable pace...
>
> > I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> > turns out to misbehave in further testing, I'll know where to turn :-P
>
> Anything with variable-length wildcard matching (*+?) is going to
> drag your performance down. There was an earlier thread on this very
> topic.  Another stupid question is how are you planning on handling
> ISBNs formatted with hyphens for readability?


According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""

Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:

1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)

2. post-process the matches: strip out any '-'s, check for remaining
length == 10.

Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegExp performance?

Reply via email to