On Thu, 5 May 2016 11:21 pm, Random832 wrote: > On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: >> Putting non-ASCII letters aside for the moment, how would you match these >> specs as a regular expression? > > Well, obviously *your* language (not the OP's), given the cases you > reject, is "one or more sequences of letters separated by > space*-ampersand-space*", and that is actually one of the easiest kinds > of regex to write: "[A-Z]+( *& *[A-Z]+)*".
One of the easiest kind of regex to write incorrectly: py> re.match("[A-Z]+( *& *[A-Z]+)*", "A----") <_sre.SRE_Match object at 0xb7bf4aa0> It doesn't even get the "all uppercase" part of the specification: py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz") <_sre.SRE_Match object at 0xb7bf4aa0> You failed to anchor the string at the beginning and end of the string, an easy mistake to make, but that's the point. It's easy to make mistakes with regexes because the syntax is so overly terse and unforgiving. But I think I just learned something important today. I learned that's it's not actually regexes that I dislike, it's regex culture that I dislike. What I learned from this thread: - Nobody could possibly want to support non-ASCII text. (Apart from the approximately 6.5 billion people in the world that don't speak English of course, an utterly insignificant majority.) - Data validity doesn't matter, because there's no possible way that you might accidentally scrape data from the wrong part of a HTML file and end up with junk input. - Even if you do somehow end up with junk, there couldn't possibly be any real consequences to that. - It doesn't matter if you match too much, or to little, that just means the specs are too pedantic. Hence the famous quote: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. It's not really regexes that are the problem. > However, your spec is wrong: How can you say that? It's *my* spec, I can specify anything I want. >> - Leading or trailing spaces, or spaces not surrounding an ampersand, >> must not match: "AAA BBB" must be rejected. > > The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS > CONSULTANTS & TRAINERS'. That's very nice, but irrelevant. I'm not talking about the OP's outputs. I'm giving my own. -- Steven -- https://mail.python.org/mailman/listinfo/python-list