On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <[email protected]> wrote:
> I have a large RE (223613 chars) that works fine in CPython 2.6, but
That's truly horrible, but I assume you have a good reason for it.
> seems to produce an endless loop in IronPython (see below). I'm using
> Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have
> pointers to the differences between them? Is
> System::Text::RegularExpressions in .NET configurable in some fashion
> that might help?
First off, is there a reason you don't use re.IGNORECASE? That would
cut the regex in half, at least.
For the most part, CPython and IronPython regexes should be fairly
compatible - IronPython takes the regex and massages it to work with
System.Text.RE, but the changes are pretty straightforward and small,
and I don't think the re you provided hits any of them. It's quite
possible that the Mono version of System.Text.RE can't handle the
expression; you could test this saving the full regex and building a
small C# program that runs it. The regex template has a lot of
potential backtracking in it; are you sure it's not caught in a
pathological (exponential) case?
Finally, is one ginormous really the best way to do this? Have you
tried other approaches?
- Jeff
>
> I'm a .NET newbie.
>
> TIA,
>
> Bill
>
> --------------------------------------------------
> import sys, os, re
>
> try:
> # we use the name lists in nltk to create person-name matching patterns
> import nltk.data
> except ImportError:
> sys.stderr.write("Can't import nltk; can't do name lists.\nSee
> http://www.nltk.org/.\n")
> sys.exit(1)
> else:
> __MALE_NAME_EXCLUDES = ("Hill",
> "Ave",
> )
> __FEMALE_NAME_EXCLUDES = ()
> __FEMALE_NAMES = [x for x in
> nltk.data.load("corpora/names/female.txt",
> format="raw").split("\n")
> if (x and (x not in __FEMALE_NAME_EXCLUDES))]
> __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES]
> __MALE_NAMES = [x for x in
> nltk.data.load("corpora/names/male.txt",
> format="raw").split("\n")
> if (x and (x not in __MALE_NAME_EXCLUDES))]
> __MALE_NAMES += [x.upper() for x in __MALE_NAMES]
> __INITS = [chr(x) for x in range(ord('A'), ord('Z'))]
>
> PERSON_PATTERN = re.compile(
> "^((?P<honorific>Mr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )?" # honorific
> "(?P<firstname>" +
> "|".join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name
> ")"
> "( (?P<middlename>([A-Z]\.)|(" +
> "|".join(__FEMALE_NAMES + __MALE_NAMES) + # middle initial or name
> ")))?"
> " +(?P<lastname>[A-Z][A-Za-z]+)", # space then last name
> re.MULTILINE)
>
> print PERSON_PATTERN.match("Mr. John Smith")
> _______________________________________________
> Users mailing list
> [email protected]
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>
_______________________________________________
Users mailing list
[email protected]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com