Re: Whittle it on down

2016-05-06 Thread DFS
On 5/6/2016 11:44 AM, Peter Otten wrote: DFS wrote: There are up to 4 levels of categorization: http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 Level 2. To get the Level 3 and 4 you have to drill-down using the hyperlinks. How to do it in python code is beyond my ski

Re: Whittle it on down

2016-05-06 Thread Peter Otten
DFS wrote: > There are up to 4 levels of categorization: > http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 > Level 2. To get the Level 3 and 4 you have to drill-down using the > hyperlinks. > > How to do it in python code is beyond my skills at this point. Get the > hre

Re: Whittle it on down

2016-05-06 Thread DFS
On 5/6/2016 9:58 AM, DFS wrote: On 5/6/2016 3:45 AM, Peter Otten wrote: DFS wrote: Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" g

Re: Whittle it on down

2016-05-06 Thread DFS
On 5/6/2016 3:45 AM, Peter Otten wrote: DFS wrote: Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

Re: Whittle it on down

2016-05-06 Thread alister
On Thu, 05 May 2016 19:31:33 -0400, DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> > input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs > & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city > guide', 'edit address', 'Tweet', 'PHY

Re: Whittle it on down

2016-05-06 Thread Peter Otten
DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> > input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & > Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city > guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & > TR

Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes: > On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > >> Random832's pattern is fine. You need to use re.fullmatch with it. > > py> re.fullmatch > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'module' object has no attribute 'fullmatch'

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. py> re.fullmatch Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'fullmatch' -- Steven -- https://mail.python

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 1:39 AM, Stephen Hansen wrote: Given: input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'H

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 2:56 PM, Stephen Hansen wrote: On Thu, May 5, 2016, at 05:31 AM, DFS wrote: You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Seriously not trying to be rude - more smart-alecky than anyt

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 1:54 PM, Steven D'Aprano wrote: On Thu, 5 May 2016 10:31 pm, DFS wrote: You are out of your mind. That's twice you've tried to put me down, first by dismissing my comments about text processing with "Linguist much", and now an outright insult. The first time I laughed it off and m

Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 11:03 AM, Steven D'Aprano wrote: > - Nobody could possibly want to support non-ASCII text. (Apart from the > approximately 6.5 billion people in the world that don't speak English of > course, an utterly insignificant majority.) Oh, I'd absolutely want to support non-ASCII

Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 05:31 AM, DFS wrote: > You are out of your mind. Whoa, now. I might disagree with Steven D'Aprano about how to approach this problem, but there's no need to be rude. Everyone's trying to help you, after all. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.pyt

Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote: > On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > > > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > >> Oh, a further thought... > >> > >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > >> > I don't even care about

Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 14:27, Jussi Piitulainen wrote: > Random832's pattern is fine. You need to use re.fullmatch with it. Heh, in my previous post I said "and one could easily imagine an API that implicitly anchors at the end". So easy to imagine it turns out that someone already did, as it tur

Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 14:03, Steven D'Aprano wrote: > You failed to anchor the string at the beginning and end of the string, > an easy mistake to make, but that's the point. I don't think anchoring is properly a concern of the regex itself - .match is anchored implicitly at the beginning, and o

Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes: > On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote: > >> Steven D'Aprano writes: >> >>> I get something like this: >>> >>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >>> >>> >>> but it fails on strings like "AA & A & A". What am I doing wrong? >> >> It cannot spl

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote: > Steven D'Aprano writes: > >> I get something like this: >> >> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >> >> >> but it fails on strings like "AA & A & A". What am I doing wrong? > > It cannot split the string as (LETTERS & LETTERS)(L

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:21 pm, Random832 wrote: > On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: >> Putting non-ASCII letters aside for the moment, how would you match these >> specs as a regular expression? > > Well, obviously *your* language (not the OP's), given the cases you > reject, is

Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes: > I get something like this: > > r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" > > > but it fails on strings like "AA & A & A". What am I doing wrong? It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) when the middle part is just one LETTER. That's som

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 10:31 pm, DFS wrote: > You are out of your mind. That's twice you've tried to put me down, first by dismissing my comments about text processing with "Linguist much", and now an outright insult. The first time I laughed it off and made a joke about it. I won't do that again. Y

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote: > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: >> Oh, a further thought... >> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote: >> > I don't even care about faster: Its overly complicated. Sometimes a >> > regular expression rea

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:13 pm, Random832 wrote: > On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: >> > There's no situation where "&" and " " will exist in the given >> > dataset, and recognizing that is important. You don't have to account >> > for every bit of nonsense. >> >> Whenev

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 06:17 pm, Peter Otten wrote: >> I get something like this: >> >> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" >> >> >> but it fails on strings like "AA & A & A". What am I doing wrong? > test("^A+( *& *A+)*$") Thanks Peter, that's nice! -- Steven -- https://mail.pyt

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 9:32 AM, Stephen Hansen wrote: On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: Oh, a further thought... On Thursday 05 May 2016 16:46, Stephen Hansen wrote: I don't even care about faster: Its overly complicated. Sometimes a regular expression really is the clearest way to

Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > Oh, a further thought... > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > Putting no

Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote: > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? Well, obviously *your* language (not the OP's), given the cases you reject, is "one or more sequences of letters separated by space*-a

Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: > > There's no situation where "&" and " " will exist in the given > > dataset, and recognizing that is important. You don't have to account > > for every bit of nonsense. > > Whenever a programmer says "This case will never happen",

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 1:53 AM, Jussi Piitulainen wrote: Either way is easy to approximate with a regex: import re upper = re.compile(r'[A-Z &]+') lower = re.compile(r'[^A-Z &]') print([datum for datum in data if upper.fullmatch(datum)]) print([datum for datum in data if not lower.search(datum)]) This

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 1:39 AM, Stephen Hansen wrote: pattern = re.compile(r"^[A-Z\s&]+$") output = [x for x in list if pattern.match(x)] Holy Shr"^[A-Z\s&]+$" One line of parsing! I was figuring a few list comprehensions would do it - this is better. (note: the reason I specified 'spaces aroun

Re: Whittle it on down

2016-05-05 Thread DFS
On 5/5/2016 2:04 AM, Steven D'Aprano wrote: On Thursday 05 May 2016 14:58, DFS wrote: Want to whittle a list like this: [...] Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) Start by writing a functi

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thursday 05 May 2016 17:34, Stephen Hansen wrote: > Meh. You have a pedantic definition of wrong. Given the inputs, it > produced right output. Very often that's enough. Perfect is the enemy of > good, it's said. And this is a *perfect* example of why we have things like this: http://www.bbc

Re: Whittle it on down

2016-05-05 Thread Peter Otten
Steven D'Aprano wrote: > Oh, a further thought... > > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > >> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >>> Start by writing a function or a regex that will distinguish strings >>> that match your conditions from those that don'

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
Oh, a further thought... On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >> Start by writing a function or a regex that will distinguish strings that >> match your conditions from those that don't. A regex might be faster, but >> he

Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote: > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > >> Start by writing a function or a regex that will distinguish strings that > >> match your conditions from those that do

Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: >> Start by writing a function or a regex that will distinguish strings that >> match your conditions from those that don't. A regex might be faster, but >> here's a function version. >>

Re: Whittle it on down

2016-05-04 Thread Stephen Hansen
On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > Start by writing a function or a regex that will distinguish strings that > match your conditions from those that don't. A regex might be faster, but > here's a function version. > ... snip ... Yikes. I'm all for the idea that one should

Re: Whittle it on down

2016-05-04 Thread Steven D'Aprano
On Thursday 05 May 2016 14:58, DFS wrote: > Want to whittle a list like this: [...] > Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) Start by writing a function or a regex that will distinguish strings

Re: Whittle it on down

2016-05-04 Thread Jussi Piitulainen
DFS writes: . . > Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) > > Is it easier to extract elements meeting those conditions, or remove > elements meeting the following conditions: > > * elements with

Re: Whittle it on down

2016-05-04 Thread Stephen Hansen
On Wed, May 4, 2016, at 09:58 PM, DFS wrote: > Want to whittle a list like this: > > [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & > Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', > 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', > '

Whittle it on down

2016-05-04 Thread DFS
Want to whittle a list like this: [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',