Re: Whittle it on down
On 5/6/2016 11:44 AM, Peter Otten wrote: DFS wrote: There are up to 4 levels of categorization: http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 Level 2. To get the Level 3 and 4 you have to drill-down using the hyperlinks. How to do it in python code is beyond my skills at this point. Get the hrefs and load them and parse, then get the next level and load them and parse, etc.? Yes, that should work ;) How about you do it, and I'll tell you if you did it right? ha! -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
DFS wrote: > There are up to 4 levels of categorization: > http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 > Level 2. To get the Level 3 and 4 you have to drill-down using the > hyperlinks. > > How to do it in python code is beyond my skills at this point. Get the > hrefs and load them and parse, then get the next level and load them and > parse, etc.? Yes, that should work ;) -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/6/2016 9:58 AM, DFS wrote: On 5/6/2016 3:45 AM, Peter Otten wrote: DFS wrote: Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. "AUTOMOBILE - DEALERS" gets removed because of the dash. I updated your regex and it seems to have fixed it. orig: (r"^[A-Z\s&]+$") new : (r"^[A-Z\s&,-]+$") Thanks again. If there is a "master list" compare your candidates against it instead of using a heuristic, i. e. categories = set(master_list) output = [category for category in input if category in categories] You can find the categories with import urllib.request import bs4 soup = bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read()) categories = set() for li in soup.find_all("li"): ... assert li.parent.parent["class"][0].startswith("category_items") ... categories.add(li.text) ... print("\n".join(sorted(categories)[:10])) "import urllib.request ImportError: No module named request" Figured it out using urllib2. Your code returns 411 categories from that first page. There are up to 4 levels of categorization: Level 1: Arts & Entertainment Level 2: Newspapers Level 3: Newspaper Brokers Level 3: Newspaper Dealers Back Number Level 3: Newspaper Delivery Level 3: Newspaper Distributors Level 3: Newsracks Level 3: Printers Newspapers Level 3: Newspaper Dealers Level 3: News Dealers Level 4: News Dealers Wholesale Level 4: Shoppers News Publications Level 3: News Service Level 4: Newspaper Feature Syndicates Level 4: Prepress Services http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 Level 2. To get the Level 3 and 4 you have to drill-down using the hyperlinks. How to do it in python code is beyond my skills at this point. Get the hrefs and load them and parse, then get the next level and load them and parse, etc.? -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/6/2016 3:45 AM, Peter Otten wrote: DFS wrote: Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. "AUTOMOBILE - DEALERS" gets removed because of the dash. I updated your regex and it seems to have fixed it. orig: (r"^[A-Z\s&]+$") new : (r"^[A-Z\s&,-]+$") Thanks again. If there is a "master list" compare your candidates against it instead of using a heuristic, i. e. categories = set(master_list) output = [category for category in input if category in categories] You can find the categories with import urllib.request import bs4 soup = bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read()) categories = set() for li in soup.find_all("li"): ... assert li.parent.parent["class"][0].startswith("category_items") ... categories.add(li.text) ... print("\n".join(sorted(categories)[:10])) "import urllib.request ImportError: No module named request" I'm on python 2.7.11 Accounting & Bookkeeping Services Adoption Services Adult Entertainment Advertising Agricultural Equipment & Supplies Agricultural Production Agricultural Services Aids Resources Aircraft Charters & Rentals Aircraft Dealers & Services Yeah, I actually did something like that last night. Was trying to get their full tree structure, which goes 4 levels deep: ie Arts & Entertainment Newpapers News Dealers Prepess Services What I referred to as their 'master list' is actually just 2 levels deep. My bad. So far I haven't come across one that had anything in it but letters, dashes, commas or ampersands. Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 05 May 2016 19:31:33 -0400, DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> > input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs > & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city > guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & > TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', > 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', > 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', > 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', > 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE > & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & > GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS > TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', > 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', > 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', > 'F.A.Q.'] >> >> Then: >> > pattern = re.compile(r"^[A-Z\s&]+$") > output = [x for x in list if pattern.match(x)] > output > >> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & >> GYMNASIUMS', >> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', >> 'GYMNASIUMS', >> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS >> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS >> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH >> CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] > > > Should've looked earlier. Their master list of categories > http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, > and the ampersands we talked about. > > "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the > comma. > > "AUTOMOBILE - DEALERS" gets removed because of the dash. > > I updated your regex and it seems to have fixed it. > > orig: (r"^[A-Z\s&]+$") > new : (r"^[A-Z\s&,-]+$") > > > Thanks again. it looks to me like this system is trying to prevent SQL injection attacks by blacklisting certain characters. this is not the correct way to block such attacks & is probably not a good indicator to the quality of the rest of the application. -- When love is gone, there's always justice. And when justice is gone, there's always force. And when force is gone, there's always Mom. Hi, Mom! -- Laurie Anderson -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> > input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & > Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city > guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & > TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', > 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', > 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', > 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', > 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & > PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & > GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', > '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', > 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add > /Update Listing', 'Business Profile Login', 'F.A.Q.'] >> >> Then: >> > pattern = re.compile(r"^[A-Z\s&]+$") > output = [x for x in list if pattern.match(x)] > output > >> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', >> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', >> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS >> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS >> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS >> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] > > > Should've looked earlier. Their master list of categories > http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, > and the ampersands we talked about. > > "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. > > "AUTOMOBILE - DEALERS" gets removed because of the dash. > > I updated your regex and it seems to have fixed it. > > orig: (r"^[A-Z\s&]+$") > new : (r"^[A-Z\s&,-]+$") > > > Thanks again. If there is a "master list" compare your candidates against it instead of using a heuristic, i. e. categories = set(master_list) output = [category for category in input if category in categories] You can find the categories with >>> import urllib.request >>> import bs4 >>> soup = bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read()) >>> categories = set() >>> for li in soup.find_all("li"): ... assert li.parent.parent["class"][0].startswith("category_items") ... categories.add(li.text) ... >>> print("\n".join(sorted(categories)[:10])) Accounting & Bookkeeping Services Adoption Services Adult Entertainment Advertising Agricultural Equipment & Supplies Agricultural Production Agricultural Services Aids Resources Aircraft Charters & Rentals Aircraft Dealers & Services -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes: > On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > >> Random832's pattern is fine. You need to use re.fullmatch with it. > > py> re.fullmatch > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'module' object has no attribute 'fullmatch' It's new in version 3.4 (of Python). -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. py> re.fullmatch Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'fullmatch' -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:39 AM, Stephen Hansen wrote: Given: input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] Then: pattern = re.compile(r"^[A-Z\s&]+$") output = [x for x in list if pattern.match(x)] output ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. "AUTOMOBILE - DEALERS" gets removed because of the dash. I updated your regex and it seems to have fixed it. orig: (r"^[A-Z\s&]+$") new : (r"^[A-Z\s&,-]+$") Thanks again. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 2:56 PM, Stephen Hansen wrote: On Thu, May 5, 2016, at 05:31 AM, DFS wrote: You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Seriously not trying to be rude - more smart-alecky than anything. Hope D'Aprano doesn't stay butthurt... Everyone's trying to help you, after all. Yes, and I do appreciate it. I've only been working with python for about a month, but I feel like I'm making good progress. clp is a great resource, and I'll be hanging around for a long time, and will contribute when possible. Thanks for your help. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:54 PM, Steven D'Aprano wrote: On Thu, 5 May 2016 10:31 pm, DFS wrote: You are out of your mind. That's twice you've tried to put me down, first by dismissing my comments about text processing with "Linguist much", and now an outright insult. The first time I laughed it off and made a joke about it. I won't do that again. > You asked whether it was better to extract the matching strings into a new list, or remove them in place in the existing list. I not only showed you how to do both, but I tried to give you the mental tools to understand when you should pick one answer over the other. And your response is to insult me and question my sanity. Well, DFS, I might be crazy, but I'm not stupid. If that's really how you feel about my answers, I won't make the mistake of wasting my time answering your questions in the future. Over to you now. heh! Relax, pal. I was just trying to be funny - no insult intended either time, of course. Look for similar responses from me in the future. Usenet brings out the smart-aleck in me. Actually, you should've accepted the 'Linguist much?' as a compliment, because I seriously thought you were. But you ARE out of your mind if you prefer that convoluted "function" method over a simple 1-line regex method (as per S. Hansen). def isupperalpha(string): return string.isalpha() and string.isupper() def check(string): if isupperalpha(string): return True parts = string.split("&") if len(parts) < 2: return False parts[0] = parts[0].rstrip(" ") parts[-1] = parts[-1].lstrip(" ") for i in range(1, len(parts)-1): parts[i] = parts[i].strip(" ") return all(isupperalpha(part) for part in parts) I'm sure it does the job well, but that style brings back [bad] memories of the VBA I used to write. I expected something very concise and 'pythonic' (which I'm learning is everyone's favorite mantra here in python-land). Anyway, I appreciate ALL replies to my queries. So thank you for taking the time. Whenever I'm able, I'll try to contribute to clp as well. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 11:03 AM, Steven D'Aprano wrote: > - Nobody could possibly want to support non-ASCII text. (Apart from the > approximately 6.5 billion people in the world that don't speak English of > course, an utterly insignificant majority.) Oh, I'd absolutely want to support non-ASCII text. If I have unicode input, though, I unfortunately have to rely on https://pypi.python.org/pypi/regex as 're' doesn't support matching on character properties. I keep hoping it'll replace "re", then we could do: pattern = regex.compile(ru"^\p{Lu}\s&]+$") where \p{property} matches against character properties in the unicode database. > - Data validity doesn't matter, because there's no possible way that you > might accidentally scrape data from the wrong part of a HTML file and end > up with junk input. Um, no one said that. I was arguing that the *regular expression* doesn't need to be responsible for validation. > - Even if you do somehow end up with junk, there couldn't possibly be any > real consequences to that. No one said that either... > - It doesn't matter if you match too much, or to little, that just means > the > specs are too pedantic. Or that... -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 05:31 AM, DFS wrote: > You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Everyone's trying to help you, after all. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote: > On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > > > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > >> Oh, a further thought... > >> > >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > >> > I don't even care about faster: Its overly complicated. Sometimes a > >> > regular expression really is the clearest way to solve a problem. > >> > >> Putting non-ASCII letters aside for the moment, how would you match these > >> specs as a regular expression? > > > > I don't know, but mostly because I wouldn't even try. > > Really? Peter Otten seems to have found a solution, and Random832 almost > found it too. > > > > The requirements > > are over-specified. If you look at the OP's data (and based on previous > > conversation), he's doing web scraping and trying to pull out good data. > > I'm not talking about the OP's data. I'm talking about *my* requirements. > > I thought that this was a friendly discussion about regexes, but perhaps > I > was mistaken. Because I sure am feeling a lot of hostility to the ideas > that regexes are not necessarily the only way to solve this, and that > data > validation is a good thing. Umm, what? Hostility? I have no idea where you're getting that. I didn't say that regexs are the only way to solve problems; in fact they're something I avoid using in most cases. In the OP's case, though, I did say I thought was a natural fit. Usually, I'd go for startswith/endswith, "in", slicing and such string primitives before I go for a regular expression. "Find all upper cased phrases that may have &'s in them" is something just specific enough that the built in string primitives are awkward tools. In my experience, most of the problems with regexes is people think they're the hammer and every problem is a nail: and then they get into ever more convoluted expressions that become brittle. More specific in a regular expression is not, necessarily, a virtue. In fact its exactly the opposite a lot of times. > > There's no absolutely perfect way to do that because the system he's > > scraping isn't meant for data processing. The data isn't cleanly > > articulated. > > Right. Which makes it *more*, not less, important to be sure that your > regex > doesn't match too much, because your data is likely to be contaminated by > junk strings that don't belong in the data and shouldn't be accepted. > I've > done enough web scraping to realise just how easy it is to start grabbing > data from the wrong part of the file. I have nothing against data validation: I don't think it belongs in regular expressions, though. That can be a step done afterwards. > > Instead, he wants a heuristic to pull out what look like section titles. > > Good for him. I asked a different question. Does my question not count? Sure it counts, but I don't want to engage in your theoretical exercise. That's not being hostile, that's me not wanting to think about a complex set of constraints for a regular expression for purely intellectual reasons. > I was trying to teach DFS a generic programming technique, not solve his > stupid web scraping problem for him. What happens next time when he's > trying to filter a list of floats, or Widgets? Should he convert them to > strings so he can use a regex to match them, or should he learn about > general filtering techniques? Come on. This is a bit presumptuous, don't you think? > > This translates naturally into a simple regular expression: an uppercase > > string with spaces and &'s. Now, that expression doesn't 100% encode > > every detail of that rule-- it allows both Q and Q & A-- but on my own > > looking at the data, I suspect its good enough. The titles are clearly > > separate from the other data scraped by their being upper cased. We just > > need to expand our allowed character range into spaces and &'s. > > > > Nothing in the OP's request demands the kind of rigorous matching that > > your scenario does. Its a practical problem with a simple, practical > > answer. > > Yes, and that practical answer needs to reject: > > - the empty string, because it is easy to mistakenly get empty strings > when > scraping data, especially if you post-process the data; > > - strings that are all spaces, because " " cannot possibly be a > title; > > - strings that are all ampersands, because "&" is not a title, and it > almost surely indicates that your scraping has gone wrong and you're > reading junk from somewhere; > > - even leading and trailing spaces are suspect: " FOO " doesn't match > any > of the examples given, and it seems unlikely to be a title. Presumably > the > strings have already been filtered or post-processed to have leading and > trailing spaces removed, in which case " FOO " reveals a bug. We're going to have to agree to disagree. I find all of that unnecessary. Any validation can be easily done before or after matching, you don't need to
Re: Whittle it on down
On Thu, May 5, 2016, at 14:27, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. Heh, in my previous post I said "and one could easily imagine an API that implicitly anchors at the end". So easy to imagine it turns out that someone already did, as it turns out. Batteries included indeed. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 14:03, Steven D'Aprano wrote: > You failed to anchor the string at the beginning and end of the string, > an easy mistake to make, but that's the point. I don't think anchoring is properly a concern of the regex itself - .match is anchored implicitly at the beginning, and one could easily imagine an API that implicitly anchors at the end - or you can simply check that the match length == the string length. > - Data validity doesn't matter, because there's no possible way that you > might accidentally scrape data from the wrong part of a HTML file and end > up with junk input. If you've scraped data from the wrong part of the file, then nothing you do to your regex can prevent the junk input from coincidentally matching the input format. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes: > On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote: > >> Steven D'Aprano writes: >> >>> I get something like this: >>> >>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >>> >>> >>> but it fails on strings like "AA & A & A". What am I doing wrong? >> >> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) >> when the middle part is just one LETTER. That's something of a >> misanalysis anyway. I notice that the correct pattern has already been >> posted at least thrice and you have acknowledged one of them. > > Thrice? I've seen Peter's response (he made the trivial and obvious > simplification of just using A instead of [A-Z], but that was easy to > understand), and Random832 almost got it, missing only that you need to > match the entire string, not just a substring. If there was a third > response, I missed it. I think I saw another. I may be mistaken. Random832's pattern is fine. You need to use re.fullmatch with it. . . -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote: > Steven D'Aprano writes: > >> I get something like this: >> >> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >> >> >> but it fails on strings like "AA & A & A". What am I doing wrong? > > It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) > when the middle part is just one LETTER. That's something of a > misanalysis anyway. I notice that the correct pattern has already been > posted at least thrice and you have acknowledged one of them. Thrice? I've seen Peter's response (he made the trivial and obvious simplification of just using A instead of [A-Z], but that was easy to understand), and Random832 almost got it, missing only that you need to match the entire string, not just a substring. If there was a third response, I missed it. > But I think you are also trying to do too much with a single regex. A > more promising start is to think of the whole string as "parts" joined > with "glue", then split with a glue pattern and test the parts: > > import re > glue = re.compile(" *& *| +") > keep, drop = [], [] > for datum in data: > items = glue.split(datum) > if all(map(str.isupper, items)): > keep.append(datum) > else: > drop.append(datum) Ah, the penny drops! For a while I thought you were suggesting using this to assemble a regex, and it just wasn't making sense to me. Then I realised you were using this as a matcher: feed in the list of strings, and it splits it into strings to keep and strings to discard. Nicely done, that is a good technique to remember. Thanks for the analysis! -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:21 pm, Random832 wrote: > On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: >> Putting non-ASCII letters aside for the moment, how would you match these >> specs as a regular expression? > > Well, obviously *your* language (not the OP's), given the cases you > reject, is "one or more sequences of letters separated by > space*-ampersand-space*", and that is actually one of the easiest kinds > of regex to write: "[A-Z]+( *& *[A-Z]+)*". One of the easiest kind of regex to write incorrectly: py> re.match("[A-Z]+( *& *[A-Z]+)*", "A") <_sre.SRE_Match object at 0xb7bf4aa0> It doesn't even get the "all uppercase" part of the specification: py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz") <_sre.SRE_Match object at 0xb7bf4aa0> You failed to anchor the string at the beginning and end of the string, an easy mistake to make, but that's the point. It's easy to make mistakes with regexes because the syntax is so overly terse and unforgiving. But I think I just learned something important today. I learned that's it's not actually regexes that I dislike, it's regex culture that I dislike. What I learned from this thread: - Nobody could possibly want to support non-ASCII text. (Apart from the approximately 6.5 billion people in the world that don't speak English of course, an utterly insignificant majority.) - Data validity doesn't matter, because there's no possible way that you might accidentally scrape data from the wrong part of a HTML file and end up with junk input. - Even if you do somehow end up with junk, there couldn't possibly be any real consequences to that. - It doesn't matter if you match too much, or to little, that just means the specs are too pedantic. Hence the famous quote: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. It's not really regexes that are the problem. > However, your spec is wrong: How can you say that? It's *my* spec, I can specify anything I want. >> - Leading or trailing spaces, or spaces not surrounding an ampersand, >> must not match: "AAA BBB" must be rejected. > > The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS > CONSULTANTS & TRAINERS'. That's very nice, but irrelevant. I'm not talking about the OP's outputs. I'm giving my own. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano writes: > I get something like this: > > r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" > > > but it fails on strings like "AA & A & A". What am I doing wrong? It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) when the middle part is just one LETTER. That's something of a misanalysis anyway. I notice that the correct pattern has already been posted at least thrice and you have acknowledged one of them. But I think you are also trying to do too much with a single regex. A more promising start is to think of the whole string as "parts" joined with "glue", then split with a glue pattern and test the parts: import re glue = re.compile(" *& *| +") keep, drop = [], [] for datum in data: items = glue.split(datum) if all(map(str.isupper, items)): keep.append(datum) else: drop.append(datum) That will cope with Greek, by the way. It's annoying that the order of the branches of the glue pattern above matters. One _does_ have problems when one uses the usual regex engines. Capturing groups in the glue pattern would produce glue items in the split output. Either avoid them or deal with them: one could split with the underspecific "([ &]+)" and then check that each glue item contains at most one ampersand. One could also allow other punctuation, and then check afterwards. One can use _another_ regex to test individual parts. Code above used str.isupper to test a part. The improved regex package (from PyPI, to cope with Greek) can do the same: import regex part = regex.compile("[[:upper:]]+") glue = regex.compile(" *& *| *") keep, drop = [], [] for datum in data: items = glue.split(datum) if all(map(part.fullmatch, items)): keep.append(datum) else: drop.append(datum) Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most of Finnish; the [:upper:] class is nicer and there's much more that is nicer in the newer regex package. The point of using a regex for this is that the part pattern can then be generalized to allow some punctuation or digits in a part, for example. Anything that the glue pattern doesn't consume. (Nothing wrong with using other techniques for this, either; str.isupper worked nicely above.) It's also possible to swap the roles of the patterns. Split with a part pattern. Then check that the text between such parts is glue: keep, drop = [], [] for datum in data: items = part.split(datum) if all(map(glue.fullmatch, items)): keep.append(datum) else: drop.append(datum) The point is to keep the patterns simple by making them more local, or more relaxed, followed by a further test. This way they can be made to do more, but not more than they reasonably can. Note also the use of re.fullmatch instead of re.match (let alone re.search) when a full match is required! This gets rid of all anchors in the pattern, which may in turn allow fewer parentheses inside the pattern. The usual regex engines are not perfect, but parts of them are fantastic. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 10:31 pm, DFS wrote: > You are out of your mind. That's twice you've tried to put me down, first by dismissing my comments about text processing with "Linguist much", and now an outright insult. The first time I laughed it off and made a joke about it. I won't do that again. You asked whether it was better to extract the matching strings into a new list, or remove them in place in the existing list. I not only showed you how to do both, but I tried to give you the mental tools to understand when you should pick one answer over the other. And your response is to insult me and question my sanity. Well, DFS, I might be crazy, but I'm not stupid. If that's really how you feel about my answers, I won't make the mistake of wasting my time answering your questions in the future. Over to you now. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: >> Oh, a further thought... >> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: >> > I don't even care about faster: Its overly complicated. Sometimes a >> > regular expression really is the clearest way to solve a problem. >> >> Putting non-ASCII letters aside for the moment, how would you match these >> specs as a regular expression? > > I don't know, but mostly because I wouldn't even try. Really? Peter Otten seems to have found a solution, and Random832 almost found it too. > The requirements > are over-specified. If you look at the OP's data (and based on previous > conversation), he's doing web scraping and trying to pull out good data. I'm not talking about the OP's data. I'm talking about *my* requirements. I thought that this was a friendly discussion about regexes, but perhaps I was mistaken. Because I sure am feeling a lot of hostility to the ideas that regexes are not necessarily the only way to solve this, and that data validation is a good thing. > There's no absolutely perfect way to do that because the system he's > scraping isn't meant for data processing. The data isn't cleanly > articulated. Right. Which makes it *more*, not less, important to be sure that your regex doesn't match too much, because your data is likely to be contaminated by junk strings that don't belong in the data and shouldn't be accepted. I've done enough web scraping to realise just how easy it is to start grabbing data from the wrong part of the file. > Instead, he wants a heuristic to pull out what look like section titles. Good for him. I asked a different question. Does my question not count? > The OP looked at the data and came up with a simple set of rules that > identify these section titles: > >>> Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) That simple rule doesn't match his examples, as I know too well because I made the silly mistake of writing to the written spec as written without reading the examples as well. As I already admitted. That was a silly mistake because I know very well that people are really bad at writing detailed specs that neither match too much nor too little. But you know, I was more focused on the rest of his question, namely whether it was better to extract the matches strings into a new list, or delete the non-matches from the existing string, and just got carried away writing the match function. I didn't actually expect anyone to use it. It was untested, and I hinted that a regex would probably be better. I was trying to teach DFS a generic programming technique, not solve his stupid web scraping problem for him. What happens next time when he's trying to filter a list of floats, or Widgets? Should he convert them to strings so he can use a regex to match them, or should he learn about general filtering techniques? > This translates naturally into a simple regular expression: an uppercase > string with spaces and &'s. Now, that expression doesn't 100% encode > every detail of that rule-- it allows both Q and Q & A-- but on my own > looking at the data, I suspect its good enough. The titles are clearly > separate from the other data scraped by their being upper cased. We just > need to expand our allowed character range into spaces and &'s. > > Nothing in the OP's request demands the kind of rigorous matching that > your scenario does. Its a practical problem with a simple, practical > answer. Yes, and that practical answer needs to reject: - the empty string, because it is easy to mistakenly get empty strings when scraping data, especially if you post-process the data; - strings that are all spaces, because " " cannot possibly be a title; - strings that are all ampersands, because "&" is not a title, and it almost surely indicates that your scraping has gone wrong and you're reading junk from somewhere; - even leading and trailing spaces are suspect: " FOO " doesn't match any of the examples given, and it seems unlikely to be a title. Presumably the strings have already been filtered or post-processed to have leading and trailing spaces removed, in which case " FOO " reveals a bug. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 11:13 pm, Random832 wrote: > On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: >> > There's no situation where "&" and " " will exist in the given >> > dataset, and recognizing that is important. You don't have to account >> > for every bit of nonsense. >> >> Whenever a programmer says "This case will never happen", ten thousand >> computers crash. > > What crash can including such an entry in the output list cause? How do I know? It depends what you do with that list. But if you assume that your list contains alphabetical strings, and pass it on to code that expects alphabetical strings, why is it so hard to believe that it might choke when it receives a non-alphabetical string? > Should the regex also ensure that the data only includes *english words* > separated by space-ampersand-space? That wasn't part of the specification. But for some applications, yes, you should ensure the data includes only English words. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, 5 May 2016 06:17 pm, Peter Otten wrote: >> I get something like this: >> >> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >> >> >> but it fails on strings like "AA & A & A". What am I doing wrong? > test("^A+( *& *A+)*$") Thanks Peter, that's nice! -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 9:32 AM, Stephen Hansen wrote: On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: Oh, a further thought... On Thursday 05 May 2016 16:46, Stephen Hansen wrote: I don't even care about faster: Its overly complicated. Sometimes a regular expression really is the clearest way to solve a problem. Putting non-ASCII letters aside for the moment, how would you match these specs as a regular expression? I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. Assigned by a company named localeze, apparently. http://www.usdirectory.com/cat/g0 https://www.neustarlocaleze.biz/welcome/ The OP looked at the data and came up with a simple set of rules that identify these section titles: Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. Yes. And simplicity + practicality = successfulality. And I do a sanity check before using the data anyway: after parse and cleanup and regex matching, I make sure all lists have the same number of elements: lenData = [len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)] if len(set(lenData)) != 1: alert the media -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > Oh, a further thought... > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. The OP looked at the data and came up with a simple set of rules that identify these section titles: >> Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? Well, obviously *your* language (not the OP's), given the cases you reject, is "one or more sequences of letters separated by space*-ampersand-space*", and that is actually one of the easiest kinds of regex to write: "[A-Z]+( *& *[A-Z]+)*". However, your spec is wrong: > - Leading or trailing spaces, or spaces not surrounding an ampersand, > must not match: "AAA BBB" must be rejected. The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS CONSULTANTS & TRAINERS'. If you want something that's extremely conservative (except for the *very odd in context* choice of allowing arbitrary numbers of spaces - why would you allow this but reject leading or trailing space?) and accepts all of OP's input: [A-Z]+(( *& *| +)[A-Z]+)* -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: > > There's no situation where "&" and " " will exist in the given > > dataset, and recognizing that is important. You don't have to account > > for every bit of nonsense. > > Whenever a programmer says "This case will never happen", ten thousand > computers crash. What crash can including such an entry in the output list cause? Should the regex also ensure that the data only includes *english words* separated by space-ampersand-space? -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:53 AM, Jussi Piitulainen wrote: Either way is easy to approximate with a regex: import re upper = re.compile(r'[A-Z &]+') lower = re.compile(r'[^A-Z &]') print([datum for datum in data if upper.fullmatch(datum)]) print([datum for datum in data if not lower.search(datum)]) This is similar to Hansen's solution. I've skipped testing that the ampersand is between spaces, and I've skipped the period. Adjust. Will do. This considers only ASCII upper case letters. You can add individual letters that matter to you, or you can reach for the documentation to find if there is some generic notation for all upper case letters. The newer regex package on PyPI supports POSIX character classes like [:upper:], I think, and there may or may not be notation for Unicode character categories in re or regex - LU would be Letter, Uppercase. Thanks. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 1:39 AM, Stephen Hansen wrote: pattern = re.compile(r"^[A-Z\s&]+$") output = [x for x in list if pattern.match(x)] Holy Shr"^[A-Z\s&]+$" One line of parsing! I was figuring a few list comprehensions would do it - this is better. (note: the reason I specified 'spaces around ampersand' is so it would remove 'Q' if that ever came up - but some people write 'Q & A', so I'll live with that exception, or try to tweak it myself. You're the man, man. Thank you! -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On 5/5/2016 2:04 AM, Steven D'Aprano wrote: On Thursday 05 May 2016 14:58, DFS wrote: Want to whittle a list like this: [...] Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) Start by writing a function or a regex that will distinguish strings that match your conditions from those that don't. A regex might be faster, but here's a function version. def isupperalpha(string): return string.isalpha() and string.isupper() def check(string): if isupperalpha(string): return True parts = string.split("&") if len(parts) < 2: return False # Don't strip leading spaces from the start of the string. parts[0] = parts[0].rstrip(" ") # Or trailing spaces from the end of the string. parts[-1] = parts[-1].lstrip(" ") # But strip leading and trailing spaces from the middle parts # (if any). for i in range(1, len(parts)-1): parts[i] = parts[i].strip(" ") return all(isupperalpha(part) for part in parts) Now you have two ways of filtering this. The obvious way is to extract elements which meet the condition. Here are two ways: # List comprehension. newlist = [item for item in oldlist if check(item)] # Filter, Python 2 version newlist = filter(check, oldlist) # Filter, Python 3 version newlist = list(filter(check, oldlist)) In practice, this is the best (fastest, simplest) way. But if you fear that you will run out of memory dealing with absolutely humongous lists with hundreds of millions or billions of strings, you can remove items in place: def remove(func, alist): for i in range(len(alist)-1, -1, -1): if not func(alist[i]): del alist[i] Note the magic incantation to iterate from the end of the list towards the front. If you do it the other way, Bad Things happen. Note that this will use less memory than extracting the items, but it will be much slower. You can combine the best of both words. Here is a version that uses a temporary list to modify the original in place: # works in both Python 2 and 3 def remove(func, alist): # Modify list in place, the fast way. alist[:] = filter(check, alist) You are out of your mind. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thursday 05 May 2016 17:34, Stephen Hansen wrote: > Meh. You have a pedantic definition of wrong. Given the inputs, it > produced right output. Very often that's enough. Perfect is the enemy of > good, it's said. And this is a *perfect* example of why we have things like this: http://www.bbc.com/future/story/20160325-the-names-that-break-computer- systems "Nobody will ever be called Null." "Nobody has quotation marks in their name." "Nobody will have a + sign in their email address." "Nobody has a legal gender other than Male or Female." "Nobody will lean on the keyboard and enter gobbledygook into our form." "Nobody will try to write more data than the space they allocated for it." > There's no situation where "&" and " " will exist in the given > dataset, and recognizing that is important. You don't have to account > for every bit of nonsense. Whenever a programmer says "This case will never happen", ten thousand computers crash. http://www.kr41.net/2016/05-03-shit_driven_development.html -- Steven D'Aprano -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Steven D'Aprano wrote: > Oh, a further thought... > > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > >> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >>> Start by writing a function or a regex that will distinguish strings >>> that match your conditions from those that don't. A regex might be >>> faster, but here's a function version. >>> ... snip ... >> >> Yikes. I'm all for the idea that one shouldn't go to regex when Python's >> powerful string type can answer the problem more clearly, but this seems >> to go out of its way to do otherwise. >> >> I don't even care about faster: Its overly complicated. Sometimes a >> regular expression really is the clearest way to solve a problem. > > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? > > - All uppercase ASCII letters (A to Z only), optionally separated into > words by either a bare ampersand (e.g. "AAA") or an ampersand with > leading and > trailing spaces (spaces only, not arbitrary whitespace): "AAA & AAA". > > - The number of spaces on either side of the ampersands need not be the > same: "AAA& BBB & CCC" should match. > > - Leading or trailing spaces, or spaces not surrounding an ampersand, must > not match: "AAA BBB" must be rejected. > > - Leading or trailing ampersands must also be rejected. This includes the > case where the string is nothing but ampersands. > > - Consecutive ampersands "AAA&&" and the empty string must be > rejected. > > > I get something like this: > > r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" > > > but it fails on strings like "AA & A & A". What am I doing wrong? > > > For the record, here's my brief test suite: > > > def test(pat): > for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "): > assert re.match(pat, s) is None > for s in ("A", "A & A", "AA", "AA & A & A"): > assert re.match(pat, s) >>> def test(pat): ... for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "): ... assert re.match(pat, s) is None ... for s in ("A", "A & A", "AA", "AA & A & A"): ... assert re.match(pat, s) ... >>> test("^A+( *& *A+)*$") >>> -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
Oh, a further thought... On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >> Start by writing a function or a regex that will distinguish strings that >> match your conditions from those that don't. A regex might be faster, but >> here's a function version. >> ... snip ... > > Yikes. I'm all for the idea that one shouldn't go to regex when Python's > powerful string type can answer the problem more clearly, but this seems > to go out of its way to do otherwise. > > I don't even care about faster: Its overly complicated. Sometimes a > regular expression really is the clearest way to solve a problem. Putting non-ASCII letters aside for the moment, how would you match these specs as a regular expression? - All uppercase ASCII letters (A to Z only), optionally separated into words by either a bare ampersand (e.g. "AAA") or an ampersand with leading and trailing spaces (spaces only, not arbitrary whitespace): "AAA & AAA". - The number of spaces on either side of the ampersands need not be the same: "AAA& BBB & CCC" should match. - Leading or trailing spaces, or spaces not surrounding an ampersand, must not match: "AAA BBB" must be rejected. - Leading or trailing ampersands must also be rejected. This includes the case where the string is nothing but ampersands. - Consecutive ampersands "AAA&&" and the empty string must be rejected. I get something like this: r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" but it fails on strings like "AA & A & A". What am I doing wrong? For the record, here's my brief test suite: def test(pat): for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "): assert re.match(pat, s) is None for s in ("A", "A & A", "AA", "AA & A & A"): assert re.match(pat, s) -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote: > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > >> Start by writing a function or a regex that will distinguish strings that > >> match your conditions from those that don't. A regex might be faster, but > >> here's a function version. > >> ... snip ... > > > > Yikes. I'm all for the idea that one shouldn't go to regex when Python's > > powerful string type can answer the problem more clearly, but this seems > > to go out of its way to do otherwise. > > > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > You're probably right, but I find it easier to reason about matching in > Python rather than the overly terse, cryptic regular expression mini- > language. > > I haven't tested my function version, but I'm 95% sure that it is > correct. > It trickiest part of it is the logic about splitting around ampersands. > And > I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, > or > at signs". But your regex solution: > > r"^[A-Z\s&]+$" > > is much smaller and more compact, but *wrong*. For instance, your regex > wrongly accepts both "&" and " " as valid strings, and wrongly > rejects "ΔΣΘΛ". Your Greek customers will be sad... Meh. You have a pedantic definition of wrong. Given the inputs, it produced right output. Very often that's enough. Perfect is the enemy of good, it's said. There's no situation where "&" and " " will exist in the given dataset, and recognizing that is important. You don't have to account for every bit of nonsense. If the OP needs a unicode-aware solution that redefines "A-Z" as perhaps "\w" with an isupper call. Its still far simpler then you're suggesting. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >> Start by writing a function or a regex that will distinguish strings that >> match your conditions from those that don't. A regex might be faster, but >> here's a function version. >> ... snip ... > > Yikes. I'm all for the idea that one shouldn't go to regex when Python's > powerful string type can answer the problem more clearly, but this seems > to go out of its way to do otherwise. > > I don't even care about faster: Its overly complicated. Sometimes a > regular expression really is the clearest way to solve a problem. You're probably right, but I find it easier to reason about matching in Python rather than the overly terse, cryptic regular expression mini- language. I haven't tested my function version, but I'm 95% sure that it is correct. It trickiest part of it is the logic about splitting around ampersands. And I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, or at signs". But your regex solution: r"^[A-Z\s&]+$" is much smaller and more compact, but *wrong*. For instance, your regex wrongly accepts both "&" and " " as valid strings, and wrongly rejects "ΔΣΘΛ". Your Greek customers will be sad... Oh, I just realised, I should have looked more closely at the examples given. because the specification given by DFS does not match the examples. DFS says that only uppercase letters and ampersands are allowed, but their examples include strings with spaces, e.g. 'FITNESS CENTERS' despite the lack of ampersands. (I read the spec literally as spaces only allowed if they surround an ampersand.) Oops, mea culpa. That makes the check function much simpler and easier to extend: def check(string): string = string.replace("&", "").replace(" ", "") return string.isalpha() and string.isupper() and now I'm 95% confident it is correct without testing, this time for sure! ;-) -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > Start by writing a function or a regex that will distinguish strings that > match your conditions from those that don't. A regex might be faster, but > here's a function version. > ... snip ... Yikes. I'm all for the idea that one shouldn't go to regex when Python's powerful string type can answer the problem more clearly, but this seems to go out of its way to do otherwise. I don't even care about faster: Its overly complicated. Sometimes a regular expression really is the clearest way to solve a problem. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Thursday 05 May 2016 14:58, DFS wrote: > Want to whittle a list like this: [...] > Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) Start by writing a function or a regex that will distinguish strings that match your conditions from those that don't. A regex might be faster, but here's a function version. def isupperalpha(string): return string.isalpha() and string.isupper() def check(string): if isupperalpha(string): return True parts = string.split("&") if len(parts) < 2: return False # Don't strip leading spaces from the start of the string. parts[0] = parts[0].rstrip(" ") # Or trailing spaces from the end of the string. parts[-1] = parts[-1].lstrip(" ") # But strip leading and trailing spaces from the middle parts # (if any). for i in range(1, len(parts)-1): parts[i] = parts[i].strip(" ") return all(isupperalpha(part) for part in parts) Now you have two ways of filtering this. The obvious way is to extract elements which meet the condition. Here are two ways: # List comprehension. newlist = [item for item in oldlist if check(item)] # Filter, Python 2 version newlist = filter(check, oldlist) # Filter, Python 3 version newlist = list(filter(check, oldlist)) In practice, this is the best (fastest, simplest) way. But if you fear that you will run out of memory dealing with absolutely humongous lists with hundreds of millions or billions of strings, you can remove items in place: def remove(func, alist): for i in range(len(alist)-1, -1, -1): if not func(alist[i]): del alist[i] Note the magic incantation to iterate from the end of the list towards the front. If you do it the other way, Bad Things happen. Note that this will use less memory than extracting the items, but it will be much slower. You can combine the best of both words. Here is a version that uses a temporary list to modify the original in place: # works in both Python 2 and 3 def remove(func, alist): # Modify list in place, the fast way. alist[:] = filter(check, alist) -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
DFS writes: . . > Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) > > Is it easier to extract elements meeting those conditions, or remove > elements meeting the following conditions: > > * elements with a lower-case letter in them > * elements with a number in them > * elements with a period in them > > ? > > > So far all I figured out is remove items with a period: > newlist = [ x for x in oldlist if "." not in x ] > Either way is easy to approximate with a regex: import re upper = re.compile(r'[A-Z &]+') lower = re.compile(r'[^A-Z &]') print([datum for datum in data if upper.fullmatch(datum)]) print([datum for datum in data if not lower.search(datum)]) I've skipped testing that the ampersand is between spaces, and I've skipped the period. Adjust. This considers only ASCII upper case letters. You can add individual letters that matter to you, or you can reach for the documentation to find if there is some generic notation for all upper case letters. The newer regex package on PyPI supports POSIX character classes like [:upper:], I think, and there may or may not be notation for Unicode character categories in re or regex - LU would be Letter, Uppercase. -- https://mail.python.org/mailman/listinfo/python-list
Re: Whittle it on down
On Wed, May 4, 2016, at 09:58 PM, DFS wrote: > Want to whittle a list like this: > > [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & > Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', > 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', > 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', > 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', > 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', > 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', > 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & > PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & > GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', > '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', > 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', > 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] > > down to > > ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', > 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', > 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS > TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS > PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS > & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Sometimes regular expressions are the tool to do the job: Given: >>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & >>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', >>> 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH >>> CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', >>> 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & >>> FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', >>> 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & >>> GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', >>> 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS >>> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact >>> Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', >>> 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] Then: >>> pattern = re.compile(r"^[A-Z\s&]+$") >>> output = [x for x in list if pattern.match(x)] >>> output ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list