Re: Whittle it on down

2016-05-06 Thread DFS

On 5/6/2016 11:44 AM, Peter Otten wrote:

DFS wrote:


There are up to 4 levels of categorization:



http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
Level 2.  To get the Level 3 and 4 you have to drill-down using the
hyperlinks.

How to do it in python code is beyond my skills at this point.  Get the
hrefs and load them and parse, then get the next level and load them and
parse, etc.?


Yes, that should work ;)



How about you do it, and I'll tell you if you did it right?

ha!




--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-06 Thread Peter Otten
DFS wrote:

> There are up to 4 levels of categorization:
 
> http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
> Level 2.  To get the Level 3 and 4 you have to drill-down using the
> hyperlinks.
> 
> How to do it in python code is beyond my skills at this point.  Get the
> hrefs and load them and parse, then get the next level and load them and
> parse, etc.?

Yes, that should work ;)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-06 Thread DFS

On 5/6/2016 9:58 AM, DFS wrote:

On 5/6/2016 3:45 AM, Peter Otten wrote:

DFS wrote:



Should've looked earlier.  Their master list of categories
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.


If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with


import urllib.request
import bs4
soup =

bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read())


categories = set()
for li in soup.find_all("li"):

... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...

print("\n".join(sorted(categories)[:10]))




"import urllib.request
ImportError: No module named request"



Figured it out using urllib2.  Your code returns 411 categories from 
that first page.


There are up to 4 levels of categorization:


Level 1: Arts & Entertainment
Level 2:   Newspapers

Level 3: Newspaper Brokers
Level 3: Newspaper Dealers Back Number
Level 3: Newspaper Delivery
Level 3: Newspaper Distributors
Level 3: Newsracks
Level 3: Printers Newspapers
Level 3: Newspaper Dealers

Level 3: News Dealers
Level 4:   News Dealers Wholesale
Level 4:   Shoppers News Publications

Level 3: News Service
Level 4:   Newspaper Feature Syndicates
Level 4:   Prepress Services




http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 
Level 2.  To get the Level 3 and 4 you have to drill-down using the 
hyperlinks.


How to do it in python code is beyond my skills at this point.  Get the 
hrefs and load them and parse, then get the next level and load them and 
parse, etc.?





--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-06 Thread DFS

On 5/6/2016 3:45 AM, Peter Otten wrote:

DFS wrote:



Should've looked earlier.  Their master list of categories
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.


If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with


import urllib.request
import bs4
soup =

bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read())

categories = set()
for li in soup.find_all("li"):

... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...

print("\n".join(sorted(categories)[:10]))




"import urllib.request
ImportError: No module named request"


I'm on python 2.7.11






Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services





Yeah, I actually did something like that last night.  Was trying to get
their full tree structure, which goes 4 levels deep: ie

Arts & Entertainment
  Newpapers
   News Dealers
Prepess Services


What I referred to as their 'master list' is actually just 2 levels 
deep.  My bad.


So far I haven't come across one that had anything in it but letters, 
dashes, commas or ampersands.


Thanks
--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-06 Thread alister
On Thu, 05 May 2016 19:31:33 -0400, DFS wrote:

> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
> 
>> Given:
>>
> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs
> & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE
> & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS
> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us',
> 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy',
> 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login',
> 'F.A.Q.']
>>
>> Then:
>>
> pattern = re.compile(r"^[A-Z\s&]+$")
> output = [x for x in list if pattern.match(x)]
> output
> 
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS &
>> GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE',
>> 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH
>> CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
> 
> 
> Should've looked earlier.  Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
> 
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
> comma.
> 
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
> 
> I updated your regex and it seems to have fixed it.
> 
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
> 
> 
> Thanks again.

it looks to me like this system is trying to prevent SQL injection 
attacks by blacklisting certain characters.
this is not the correct way to block such attacks & is probably not a 
good indicator to the quality of the rest of the application.



-- 
When love is gone, there's always justice.
And when justice is gone, there's always force.
And when force is gone, there's always Mom.
Hi, Mom!
-- Laurie Anderson
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-06 Thread Peter Otten
DFS wrote:

> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
> 
>> Given:
>>
> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs &
> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE &
> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS',
> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us',
> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add
>  /Update Listing', 'Business Profile Login', 'F.A.Q.']
>>
>> Then:
>>
> pattern = re.compile(r"^[A-Z\s&]+$")
> output = [x for x in list if pattern.match(x)]
> output
> 
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
>> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
> 
> 
> Should've looked earlier.  Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
> 
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
> 
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
> 
> I updated your regex and it seems to have fixed it.
> 
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
> 
> 
> Thanks again.

If there is a "master list" compare your candidates against it instead of 
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with

>>> import urllib.request
>>> import bs4
>>> soup = 
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0;).read())
>>> categories = set()
>>> for li in soup.find_all("li"):
... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
... 
>>> print("\n".join(sorted(categories)[:10]))
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes:

> On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote:
>
>> Random832's pattern is fine. You need to use re.fullmatch with it.
>
> py> re.fullmatch
> Traceback (most recent call last):
>   File "", line 1, in 
> AttributeError: 'module' object has no attribute 'fullmatch'

It's new in version 3.4 (of Python).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote:

> Random832's pattern is fine. You need to use re.fullmatch with it.

py> re.fullmatch
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'fullmatch'



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 1:39 AM, Stephen Hansen wrote:


Given:


input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 
'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & 
GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 
'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 
'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS 
CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', 
'2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 
'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']


Then:


pattern = re.compile(r"^[A-Z\s&]+$")
output = [x for x in list if pattern.match(x)]
output



['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
& GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']



Should've looked earlier.  Their master list of categories 
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, 
and the ampersands we talked about.


"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 2:56 PM, Stephen Hansen wrote:

On Thu, May 5, 2016, at 05:31 AM, DFS wrote:

You are out of your mind.


Whoa, now. I might disagree with Steven D'Aprano about how to approach
this problem, but there's no need to be rude.


Seriously not trying to be rude - more smart-alecky than anything.

Hope D'Aprano doesn't stay butthurt...




Everyone's trying to help you, after all.


Yes, and I do appreciate it.

I've only been working with python for about a month, but I feel like 
I'm making good progress.  clp is a great resource, and I'll be hanging 
around for a long time, and will contribute when possible.


Thanks for your help.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 1:54 PM, Steven D'Aprano wrote:

On Thu, 5 May 2016 10:31 pm, DFS wrote:


You are out of your mind.


That's twice you've tried to put me down, first by dismissing my comments
about text processing with "Linguist much", and now an outright insult. The
first time I laughed it off and made a joke about it. I won't do that
again.

>

You asked whether it was better to extract the matching strings into a new
list, or remove them in place in the existing list. I not only showed you
how to do both, but I tried to give you the mental tools to understand when
you should pick one answer over the other. And your response is to insult
me and question my sanity.

Well, DFS, I might be crazy, but I'm not stupid. If that's really how you
feel about my answers, I won't make the mistake of wasting my time
answering your questions in the future.

Over to you now.



heh!  Relax, pal.

I was just trying to be funny - no insult intended either time, of 
course.  Look for similar responses from me in the future.  Usenet 
brings out the smart-aleck in me.


Actually, you should've accepted the 'Linguist much?' as a compliment, 
because I seriously thought you were.


But you ARE out of your mind if you prefer that convoluted "function" 
method over a simple 1-line regex method (as per S. Hansen).


def isupperalpha(string):
return string.isalpha() and string.isupper()

def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
parts[0] = parts[0].rstrip(" ")
parts[-1] = parts[-1].lstrip(" ")
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
 return all(isupperalpha(part) for part in parts)


I'm sure it does the job well, but that style brings back [bad] memories 
of the VBA I used to write.  I expected something very concise and 
'pythonic' (which I'm learning is everyone's favorite mantra here in 
python-land).


Anyway, I appreciate ALL replies to my queries.  So thank you for taking 
the time.


Whenever I'm able, I'll try to contribute to clp as well.




--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 11:03 AM, Steven D'Aprano wrote:
> - Nobody could possibly want to support non-ASCII text. (Apart from the
> approximately 6.5 billion people in the world that don't speak English of
> course, an utterly insignificant majority.)

Oh, I'd absolutely want to support non-ASCII text. If I have unicode
input, though, I unfortunately have to rely on
https://pypi.python.org/pypi/regex as 're' doesn't support matching on
character properties. 

I keep hoping it'll replace "re", then we could do:

pattern = regex.compile(ru"^\p{Lu}\s&]+$")

where \p{property} matches against character properties in the unicode
database.

> - Data validity doesn't matter, because there's no possible way that you
> might accidentally scrape data from the wrong part of a HTML file and end
> up with junk input.

Um, no one said that. I was arguing that the *regular expression*
doesn't need to be responsible for validation.

> - Even if you do somehow end up with junk, there couldn't possibly be any
> real consequences to that.

No one said that either...

> - It doesn't matter if you match too much, or to little, that just means
> the
> specs are too pedantic.

Or that...

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 05:31 AM, DFS wrote:
> You are out of your mind.

Whoa, now. I might disagree with Steven D'Aprano about how to approach
this problem, but there's no need to be rude. Everyone's trying to help
you, after all.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote:
> On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:
> 
> > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> >> Oh, a further thought...
> >> 
> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> >> > I don't even care about faster: Its overly complicated. Sometimes a
> >> > regular expression really is the clearest way to solve a problem.
> >> 
> >> Putting non-ASCII letters aside for the moment, how would you match these
> >> specs as a regular expression?
> > 
> > I don't know, but mostly because I wouldn't even try. 
> 
> Really? Peter Otten seems to have found a solution, and Random832 almost
> found it too.
> 
> 
> > The requirements 
> > are over-specified. If you look at the OP's data (and based on previous
> > conversation), he's doing web scraping and trying to pull out good data.
> 
> I'm not talking about the OP's data. I'm talking about *my* requirements.
> 
> I thought that this was a friendly discussion about regexes, but perhaps
> I
> was mistaken. Because I sure am feeling a lot of hostility to the ideas
> that regexes are not necessarily the only way to solve this, and that
> data
> validation is a good thing.

Umm, what? Hostility? I have no idea where you're getting that.

I didn't say that regexs are the only way to solve problems; in fact
they're something I avoid using in most cases. In the OP's case, though,
I did say I thought was a natural fit. Usually, I'd go for
startswith/endswith, "in", slicing and such string primitives before I
go for a regular expression.

"Find all upper cased phrases that may have &'s in them" is something
just specific enough that the built in string primitives are awkward
tools.

In my experience, most of the problems with regexes is people think
they're the hammer and every problem is a nail: and then they get into
ever more convoluted expressions that become brittle.  More specific in
a regular expression is not, necessarily, a virtue. In fact its exactly
the opposite a lot of times.

> > There's no absolutely perfect way to do that because the system he's
> > scraping isn't meant for data processing. The data isn't cleanly
> > articulated.
> 
> Right. Which makes it *more*, not less, important to be sure that your
> regex
> doesn't match too much, because your data is likely to be contaminated by
> junk strings that don't belong in the data and shouldn't be accepted.
> I've
> done enough web scraping to realise just how easy it is to start grabbing
> data from the wrong part of the file.

I have nothing against data validation: I don't think it belongs in
regular expressions, though. That can be a step done afterwards.

> > Instead, he wants a heuristic to pull out what look like section titles.
> 
> Good for him. I asked a different question. Does my question not count?

Sure it counts, but I don't want to engage in your theoretical exercise.
That's not being hostile, that's me not wanting to think about a complex
set of constraints for a regular expression for purely intellectual
reasons.

> I was trying to teach DFS a generic programming technique, not solve his
> stupid web scraping problem for him. What happens next time when he's
> trying to filter a list of floats, or Widgets? Should he convert them to
> strings so he can use a regex to match them, or should he learn about
> general filtering techniques?

Come on. This is a bit presumptuous, don't you think?

> > This translates naturally into a simple regular expression: an uppercase
> > string with spaces and &'s. Now, that expression doesn't 100% encode
> > every detail of that rule-- it allows both Q and Q & A-- but on my own
> > looking at the data, I suspect its good enough. The titles are clearly
> > separate from the other data scraped by their being upper cased. We just
> > need to expand our allowed character range into spaces and &'s.
> > 
> > Nothing in the OP's request demands the kind of rigorous matching that
> > your scenario does. Its a practical problem with a simple, practical
> > answer.
> 
> Yes, and that practical answer needs to reject:
> 
> - the empty string, because it is easy to mistakenly get empty strings
> when
> scraping data, especially if you post-process the data;
> 
> - strings that are all spaces, because "   " cannot possibly be a
> title;
> 
> - strings that are all ampersands, because "&" is not a title, and it
> almost surely indicates that your scraping has gone wrong and you're
> reading junk from somewhere;
> 
> - even leading and trailing spaces are suspect: "  FOO  " doesn't match
> any
> of the examples given, and it seems unlikely to be a title. Presumably
> the
> strings have already been filtered or post-processed to have leading and
> trailing spaces removed, in which case "  FOO  " reveals a bug.

We're going to have to agree to disagree. I find all of that
unnecessary.  Any validation can be easily done before or after
matching, you don't need to 

Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 14:27, Jussi Piitulainen wrote:
> Random832's pattern is fine. You need to use re.fullmatch with it.

Heh, in my previous post I said "and one could easily imagine an API
that implicitly anchors at the end". So easy to imagine it turns out
that someone already did, as it turns out. Batteries included indeed.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 14:03, Steven D'Aprano wrote:
> You failed to anchor the string at the beginning and end of the string,
> an easy mistake to make, but that's the point.

I don't think anchoring is properly a concern of the regex itself -
.match is anchored implicitly at the beginning, and one could easily
imagine an API that implicitly anchors at the end - or you can simply
check that the match length == the string length.

> - Data validity doesn't matter, because there's no possible way that you
> might accidentally scrape data from the wrong part of a HTML file and end
> up with junk input.

If you've scraped data from the wrong part of the file, then nothing you
do to your regex can prevent the junk input from coincidentally matching
the input format.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes:

> On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:
>
>> Steven D'Aprano writes:
>> 
>>> I get something like this:
>>>
>>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>>
>>>
>>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
>> 
>> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
>> when the middle part is just one LETTER. That's something of a
>> misanalysis anyway. I notice that the correct pattern has already been
>> posted at least thrice and you have acknowledged one of them.
>
> Thrice? I've seen Peter's response (he made the trivial and obvious
> simplification of just using A instead of [A-Z], but that was easy to
> understand), and Random832 almost got it, missing only that you need to
> match the entire string, not just a substring. If there was a third
> response, I missed it.

I think I saw another. I may be mistaken.

Random832's pattern is fine. You need to use re.fullmatch with it.

. .
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:

> Steven D'Aprano writes:
> 
>> I get something like this:
>>
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>
>>
>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
> 
> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
> when the middle part is just one LETTER. That's something of a
> misanalysis anyway. I notice that the correct pattern has already been
> posted at least thrice and you have acknowledged one of them.

Thrice? I've seen Peter's response (he made the trivial and obvious
simplification of just using A instead of [A-Z], but that was easy to
understand), and Random832 almost got it, missing only that you need to
match the entire string, not just a substring. If there was a third
response, I missed it.


> But I think you are also trying to do too much with a single regex. A
> more promising start is to think of the whole string as "parts" joined
> with "glue", then split with a glue pattern and test the parts:
> 
> import re
> glue = re.compile(" *& *| +")
> keep, drop = [], []
> for datum in data:
> items = glue.split(datum)
> if all(map(str.isupper, items)):
> keep.append(datum)
> else:
> drop.append(datum)

Ah, the penny drops! For a while I thought you were suggesting using this to
assemble a regex, and it just wasn't making sense to me. Then I realised
you were using this as a matcher: feed in the list of strings, and it
splits it into strings to keep and strings to discard. Nicely done, that is
a good technique to remember.

Thanks for the analysis!



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:21 pm, Random832 wrote:

> On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> Well, obviously *your* language (not the OP's), given the cases you
> reject, is "one or more sequences of letters separated by
> space*-ampersand-space*", and that is actually one of the easiest kinds
> of regex to write: "[A-Z]+( *& *[A-Z]+)*".

One of the easiest kind of regex to write incorrectly:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "A")
<_sre.SRE_Match object at 0xb7bf4aa0>


It doesn't even get the "all uppercase" part of the specification:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz")
<_sre.SRE_Match object at 0xb7bf4aa0>

You failed to anchor the string at the beginning and end of the string, an
easy mistake to make, but that's the point. It's easy to make mistakes with
regexes because the syntax is so overly terse and unforgiving.

But I think I just learned something important today. I learned that's it's
not actually regexes that I dislike, it's regex culture that I dislike.
What I learned from this thread:


- Nobody could possibly want to support non-ASCII text. (Apart from the
approximately 6.5 billion people in the world that don't speak English of
course, an utterly insignificant majority.)

- Data validity doesn't matter, because there's no possible way that you
might accidentally scrape data from the wrong part of a HTML file and end
up with junk input.

- Even if you do somehow end up with junk, there couldn't possibly be any
real consequences to that.

- It doesn't matter if you match too much, or to little, that just means the
specs are too pedantic.


Hence the famous quote:

Some people, when confronted with a problem, think 
"I know, I'll use regular expressions." Now they 
have two problems.


It's not really regexes that are the problem.


> However, your spec is wrong:

How can you say that? It's *my* spec, I can specify anything I want.


>> - Leading or trailing spaces, or spaces not surrounding an ampersand,
>> must not match: "AAA BBB" must be rejected.
> 
> The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
> CONSULTANTS & TRAINERS'.

That's very nice, but irrelevant. I'm not talking about the OP's outputs.
I'm giving my own.




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Jussi Piitulainen
Steven D'Aprano writes:

> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
when the middle part is just one LETTER. That's something of a
misanalysis anyway. I notice that the correct pattern has already been
posted at least thrice and you have acknowledged one of them.

But I think you are also trying to do too much with a single regex. A
more promising start is to think of the whole string as "parts" joined
with "glue", then split with a glue pattern and test the parts:

import re
glue = re.compile(" *& *| +")
keep, drop = [], []
for datum in data:
items = glue.split(datum)
if all(map(str.isupper, items)):
keep.append(datum)
else:
drop.append(datum)

That will cope with Greek, by the way.

It's annoying that the order of the branches of the glue pattern above
matters. One _does_ have problems when one uses the usual regex engines.

Capturing groups in the glue pattern would produce glue items in the
split output. Either avoid them or deal with them: one could split with
the underspecific "([ &]+)" and then check that each glue item contains
at most one ampersand. One could also allow other punctuation, and then
check afterwards.

One can use _another_ regex to test individual parts. Code above used
str.isupper to test a part. The improved regex package (from PyPI, to
cope with Greek) can do the same:

import regex
part = regex.compile("[[:upper:]]+")
glue = regex.compile(" *& *| *")

keep, drop = [], []
for datum in data:
items = glue.split(datum)
if all(map(part.fullmatch, items)):
keep.append(datum)
else:
drop.append(datum)

Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most
of Finnish; the [:upper:] class is nicer and there's much more that is
nicer in the newer regex package.

The point of using a regex for this is that the part pattern can then be
generalized to allow some punctuation or digits in a part, for example.
Anything that the glue pattern doesn't consume. (Nothing wrong with
using other techniques for this, either; str.isupper worked nicely
above.)

It's also possible to swap the roles of the patterns. Split with a part
pattern. Then check that the text between such parts is glue:

keep, drop = [], []
for datum in data:
items = part.split(datum)
if all(map(glue.fullmatch, items)):
keep.append(datum)
else:
drop.append(datum)

The point is to keep the patterns simple by making them more local, or
more relaxed, followed by a further test. This way they can be made to
do more, but not more than they reasonably can.

Note also the use of re.fullmatch instead of re.match (let alone
re.search) when a full match is required! This gets rid of all anchors
in the pattern, which may in turn allow fewer parentheses inside the
pattern.

The usual regex engines are not perfect, but parts of them are
fantastic.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 10:31 pm, DFS wrote:

> You are out of your mind.

That's twice you've tried to put me down, first by dismissing my comments
about text processing with "Linguist much", and now an outright insult. The
first time I laughed it off and made a joke about it. I won't do that
again.

You asked whether it was better to extract the matching strings into a new
list, or remove them in place in the existing list. I not only showed you
how to do both, but I tried to give you the mental tools to understand when
you should pick one answer over the other. And your response is to insult
me and question my sanity.

Well, DFS, I might be crazy, but I'm not stupid. If that's really how you
feel about my answers, I won't make the mistake of wasting my time
answering your questions in the future.

Over to you now.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:

> On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
>> Oh, a further thought...
>> 
>> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>> > I don't even care about faster: Its overly complicated. Sometimes a
>> > regular expression really is the clearest way to solve a problem.
>> 
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> I don't know, but mostly because I wouldn't even try. 

Really? Peter Otten seems to have found a solution, and Random832 almost
found it too.


> The requirements 
> are over-specified. If you look at the OP's data (and based on previous
> conversation), he's doing web scraping and trying to pull out good data.

I'm not talking about the OP's data. I'm talking about *my* requirements.

I thought that this was a friendly discussion about regexes, but perhaps I
was mistaken. Because I sure am feeling a lot of hostility to the ideas
that regexes are not necessarily the only way to solve this, and that data
validation is a good thing.


> There's no absolutely perfect way to do that because the system he's
> scraping isn't meant for data processing. The data isn't cleanly
> articulated.

Right. Which makes it *more*, not less, important to be sure that your regex
doesn't match too much, because your data is likely to be contaminated by
junk strings that don't belong in the data and shouldn't be accepted. I've
done enough web scraping to realise just how easy it is to start grabbing
data from the wrong part of the file.


> Instead, he wants a heuristic to pull out what look like section titles.

Good for him. I asked a different question. Does my question not count?


> The OP looked at the data and came up with a simple set of rules that
> identify these section titles:
> 
>>> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)

That simple rule doesn't match his examples, as I know too well because I
made the silly mistake of writing to the written spec as written without
reading the examples as well. As I already admitted. That was a silly
mistake because I know very well that people are really bad at writing
detailed specs that neither match too much nor too little.

But you know, I was more focused on the rest of his question, namely whether
it was better to extract the matches strings into a new list, or delete the
non-matches from the existing string, and just got carried away writing the
match function. I didn't actually expect anyone to use it. It was untested,
and I hinted that a regex would probably be better.

I was trying to teach DFS a generic programming technique, not solve his
stupid web scraping problem for him. What happens next time when he's
trying to filter a list of floats, or Widgets? Should he convert them to
strings so he can use a regex to match them, or should he learn about
general filtering techniques?


> This translates naturally into a simple regular expression: an uppercase
> string with spaces and &'s. Now, that expression doesn't 100% encode
> every detail of that rule-- it allows both Q and Q & A-- but on my own
> looking at the data, I suspect its good enough. The titles are clearly
> separate from the other data scraped by their being upper cased. We just
> need to expand our allowed character range into spaces and &'s.
> 
> Nothing in the OP's request demands the kind of rigorous matching that
> your scenario does. Its a practical problem with a simple, practical
> answer.

Yes, and that practical answer needs to reject:

- the empty string, because it is easy to mistakenly get empty strings when
scraping data, especially if you post-process the data;

- strings that are all spaces, because "   " cannot possibly be a title;

- strings that are all ampersands, because "&" is not a title, and it
almost surely indicates that your scraping has gone wrong and you're
reading junk from somewhere;

- even leading and trailing spaces are suspect: "  FOO  " doesn't match any
of the examples given, and it seems unlikely to be a title. Presumably the
strings have already been filtered or post-processed to have leading and
trailing spaces removed, in which case "  FOO  " reveals a bug.

 

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 11:13 pm, Random832 wrote:

> On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote:
>> > There's no situation where "&" and " " will exist in the given
>> > dataset, and recognizing that is important. You don't have to account
>> > for every bit of nonsense.
>> 
>> Whenever a programmer says "This case will never happen", ten thousand
>> computers crash.
> 
> What crash can including such an entry in the output list cause?

How do I know? It depends what you do with that list.

But if you assume that your list contains alphabetical strings, and pass it
on to code that expects alphabetical strings, why is it so hard to believe
that it might choke when it receives a non-alphabetical string?


> Should the regex also ensure that the data only includes *english words*
> separated by space-ampersand-space?

That wasn't part of the specification. But for some applications, yes, you
should ensure the data includes only English words.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thu, 5 May 2016 06:17 pm, Peter Otten wrote:

>> I get something like this:
>> 
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>> 
>> 
>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

> test("^A+( *& *A+)*$")

Thanks Peter, that's nice!


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 9:32 AM, Stephen Hansen wrote:

On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:

Oh, a further thought...

On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

I don't even care about faster: Its overly complicated. Sometimes a
regular expression really is the clearest way to solve a problem.


Putting non-ASCII letters aside for the moment, how would you match these
specs as a regular expression?


I don't know, but mostly because I wouldn't even try. The requirements
are over-specified. If you look at the OP's data (and based on previous
conversation), he's doing web scraping and trying to pull out good data.
There's no absolutely perfect way to do that because the system he's
scraping isn't meant for data processing. The data isn't cleanly
articulated.

Instead, he wants a heuristic to pull out what look like section titles.



Assigned by a company named localeze, apparently.

http://www.usdirectory.com/cat/g0

https://www.neustarlocaleze.biz/welcome/




The OP looked at the data and came up with a simple set of rules that
identify these section titles:


Want to keep all elements containing only upper case letters or upper

case letters and ampersand (where ampersand is surrounded by spaces)

This translates naturally into a simple regular expression: an uppercase
string with spaces and &'s. Now, that expression doesn't 100% encode
every detail of that rule-- it allows both Q and Q & A-- but on my own
looking at the data, I suspect its good enough. The titles are clearly
separate from the other data scraped by their being upper cased. We just
need to expand our allowed character range into spaces and &'s.

Nothing in the OP's request demands the kind of rigorous matching that
your scenario does. Its a practical problem with a simple, practical
answer.



Yes.  And simplicity + practicality = successfulality.

And I do a sanity check before using the data anyway: after parse and 
cleanup and regex matching, I make sure all lists have the same number 
of elements:


lenData = 
[len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)]


if len(set(lenData)) != 1:  alert the media


--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> Oh, a further thought...
> 
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> > I don't even care about faster: Its overly complicated. Sometimes a
> > regular expression really is the clearest way to solve a problem.
> 
> Putting non-ASCII letters aside for the moment, how would you match these 
> specs as a regular expression?

I don't know, but mostly because I wouldn't even try. The requirements
are over-specified. If you look at the OP's data (and based on previous
conversation), he's doing web scraping and trying to pull out good data.
There's no absolutely perfect way to do that because the system he's
scraping isn't meant for data processing. The data isn't cleanly
articulated.

Instead, he wants a heuristic to pull out what look like section titles. 

The OP looked at the data and came up with a simple set of rules that
identify these section titles:

>> Want to keep all elements containing only upper case letters or upper 
case letters and ampersand (where ampersand is surrounded by spaces)

This translates naturally into a simple regular expression: an uppercase
string with spaces and &'s. Now, that expression doesn't 100% encode
every detail of that rule-- it allows both Q and Q & A-- but on my own
looking at the data, I suspect its good enough. The titles are clearly
separate from the other data scraped by their being upper cased. We just
need to expand our allowed character range into spaces and &'s.

Nothing in the OP's request demands the kind of rigorous matching that
your scenario does. Its a practical problem with a simple, practical
answer.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Random832
On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
> Putting non-ASCII letters aside for the moment, how would you match these 
> specs as a regular expression?

Well, obviously *your* language (not the OP's), given the cases you
reject, is "one or more sequences of letters separated by
space*-ampersand-space*", and that is actually one of the easiest kinds
of regex to write: "[A-Z]+( *& *[A-Z]+)*".

However, your spec is wrong:

> - Leading or trailing spaces, or spaces not surrounding an ampersand,
> must not match: "AAA BBB" must be rejected.

The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
CONSULTANTS & TRAINERS'.

If you want something that's extremely conservative (except for the
*very odd in context* choice of allowing arbitrary numbers of spaces -
why would you allow this but reject leading or trailing space?) and
accepts all of OP's input:

[A-Z]+(( *& *| +)[A-Z]+)*
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Random832


On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote:
> > There's no situation where "&" and " " will exist in the given
> > dataset, and recognizing that is important. You don't have to account
> > for every bit of nonsense.
> 
> Whenever a programmer says "This case will never happen", ten thousand 
> computers crash.

What crash can including such an entry in the output list cause?

Should the regex also ensure that the data only includes *english words*
separated by space-ampersand-space?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 1:53 AM, Jussi Piitulainen wrote:



Either way is easy to approximate with a regex:

import re
upper = re.compile(r'[A-Z &]+')
lower = re.compile(r'[^A-Z &]')
print([datum for datum in data if upper.fullmatch(datum)])
print([datum for datum in data if not lower.search(datum)])


This is similar to Hansen's solution.




I've skipped testing that the ampersand is between spaces, and I've
skipped the period. Adjust.


Will do.



This considers only ASCII upper case letters. You can add individual
letters that matter to you, or you can reach for the documentation to
find if there is some generic notation for all upper case letters.

The newer regex package on PyPI supports POSIX character classes like
[:upper:], I think, and there may or may not be notation for Unicode
character categories in re or regex - LU would be Letter, Uppercase.


Thanks.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 1:39 AM, Stephen Hansen wrote:




pattern = re.compile(r"^[A-Z\s&]+$")



output = [x for x in list if pattern.match(x)]




Holy Shr"^[A-Z\s&]+$"  One line of parsing!

I was figuring a few list comprehensions would do it - this is better.

(note: the reason I specified 'spaces around ampersand' is so it would
remove 'Q' if that ever came up - but some people write 'Q & A', so
I'll live with that exception, or try to tweak it myself.

You're the man, man.

Thank you!




--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread DFS

On 5/5/2016 2:04 AM, Steven D'Aprano wrote:

On Thursday 05 May 2016 14:58, DFS wrote:


Want to whittle a list like this:

[...]

Want to keep all elements containing only upper case letters or upper
case letters and ampersand (where ampersand is surrounded by spaces)



Start by writing a function or a regex that will distinguish strings that
match your conditions from those that don't. A regex might be faster, but
here's a function version.

def isupperalpha(string):
return string.isalpha() and string.isupper()

def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
# Don't strip leading spaces from the start of the string.
parts[0] = parts[0].rstrip(" ")
# Or trailing spaces from the end of the string.
parts[-1] = parts[-1].lstrip(" ")
# But strip leading and trailing spaces from the middle parts
# (if any).
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
 return all(isupperalpha(part) for part in parts)


Now you have two ways of filtering this. The obvious way is to extract
elements which meet the condition. Here are two ways:

# List comprehension.
newlist = [item for item in oldlist if check(item)]

# Filter, Python 2 version
newlist = filter(check, oldlist)

# Filter, Python 3 version
newlist = list(filter(check, oldlist))


In practice, this is the best (fastest, simplest) way. But if you fear that
you will run out of memory dealing with absolutely humongous lists with
hundreds of millions or billions of strings, you can remove items in place:


def remove(func, alist):
for i in range(len(alist)-1, -1, -1):
if not func(alist[i]):
del alist[i]


Note the magic incantation to iterate from the end of the list towards the
front. If you do it the other way, Bad Things happen. Note that this will
use less memory than extracting the items, but it will be much slower.

You can combine the best of both words. Here is a version that uses a
temporary list to modify the original in place:

# works in both Python 2 and 3
def remove(func, alist):
# Modify list in place, the fast way.
alist[:] = filter(check, alist)



You are out of your mind.





--
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thursday 05 May 2016 17:34, Stephen Hansen wrote:


> Meh. You have a pedantic definition of wrong. Given the inputs, it
> produced right output. Very often that's enough. Perfect is the enemy of
> good, it's said.

And this is a *perfect* example of why we have things like this:

http://www.bbc.com/future/story/20160325-the-names-that-break-computer-
systems

"Nobody will ever be called Null."

"Nobody has quotation marks in their name."

"Nobody will have a + sign in their email address."

"Nobody has a legal gender other than Male or Female."

"Nobody will lean on the keyboard and enter gobbledygook into our form."

"Nobody will try to write more data than the space they allocated for it."


> There's no situation where "&" and " " will exist in the given
> dataset, and recognizing that is important. You don't have to account
> for every bit of nonsense.

Whenever a programmer says "This case will never happen", ten thousand 
computers crash.

http://www.kr41.net/2016/05-03-shit_driven_development.html


-- 
Steven D'Aprano

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Peter Otten
Steven D'Aprano wrote:

> Oh, a further thought...
> 
> 
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> 
>> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>>> Start by writing a function or a regex that will distinguish strings
>>> that match your conditions from those that don't. A regex might be
>>> faster, but here's a function version.
>>> ... snip ...
>> 
>> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
>> powerful string type can answer the problem more clearly, but this seems
>> to go out of its way to do otherwise.
>> 
>> I don't even care about faster: Its overly complicated. Sometimes a
>> regular expression really is the clearest way to solve a problem.
> 
> Putting non-ASCII letters aside for the moment, how would you match these
> specs as a regular expression?
> 
> - All uppercase ASCII letters (A to Z only), optionally separated into
> words by either a bare ampersand (e.g. "AAA") or an ampersand with
> leading and
> trailing spaces (spaces only, not arbitrary whitespace): "AAA   & AAA".
> 
> - The number of spaces on either side of the ampersands need not be the
> same: "AAA&   BBB &   CCC" should match.
> 
> - Leading or trailing spaces, or spaces not surrounding an ampersand, must
> not match: "AAA BBB" must be rejected.
> 
> - Leading or trailing ampersands must also be rejected. This includes the
> case where the string is nothing but ampersands.
> 
> - Consecutive ampersands "AAA&&" and the empty string must be
> rejected.
> 
> 
> I get something like this:
> 
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
> 
> 
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
> 
> 
> For the record, here's my brief test suite:
> 
> 
> def test(pat):
> for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "):
> assert re.match(pat, s) is None
> for s in ("A", "A & A", "AA", "AA   &  A &  A"):
> assert re.match(pat, s)

>>> def test(pat):
... for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "):
... assert re.match(pat, s) is None
... for s in ("A", "A & A", "AA", "AA   &  A &  A"):
... assert re.match(pat, s)
... 
>>> test("^A+( *& *A+)*$")
>>> 


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
Oh, a further thought...


On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
> 
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
> 
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.

Putting non-ASCII letters aside for the moment, how would you match these 
specs as a regular expression?

- All uppercase ASCII letters (A to Z only), optionally separated into words 
by either a bare ampersand (e.g. "AAA") or an ampersand with leading and 
trailing spaces (spaces only, not arbitrary whitespace): "AAA   & AAA".

- The number of spaces on either side of the ampersands need not be the 
same: "AAA&   BBB &   CCC" should match.

- Leading or trailing spaces, or spaces not surrounding an ampersand, must 
not match: "AAA BBB" must be rejected.

- Leading or trailing ampersands must also be rejected. This includes the 
case where the string is nothing but ampersands.

- Consecutive ampersands "AAA&&" and the empty string must be rejected.


I get something like this:

r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"


but it fails on strings like "AA   &  A &  A". What am I doing wrong?


For the record, here's my brief test suite:


def test(pat):
for s in ("", " ", "&" "A A", "A&", "", "A&", "A& "):
assert re.match(pat, s) is None
for s in ("A", "A & A", "AA", "AA   &  A &  A"):
assert re.match(pat, s)




-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote:
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
> >> Start by writing a function or a regex that will distinguish strings that
> >> match your conditions from those that don't. A regex might be faster, but
> >> here's a function version.
> >> ... snip ...
> > 
> > Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> > powerful string type can answer the problem more clearly, but this seems
> > to go out of its way to do otherwise.
> > 
> > I don't even care about faster: Its overly complicated. Sometimes a
> > regular expression really is the clearest way to solve a problem.
> 
> You're probably right, but I find it easier to reason about matching in 
> Python rather than the overly terse, cryptic regular expression mini-
> language.
> 
> I haven't tested my function version, but I'm 95% sure that it is
> correct. 
> It trickiest part of it is the logic about splitting around ampersands.
> And 
> I'll cheerfully admit that it isn't easy to extend to (say) "ampersand,
> or 
> at signs". But your regex solution:
> 
> r"^[A-Z\s&]+$"
> 
> is much smaller and more compact, but *wrong*. For instance, your regex 
> wrongly accepts both "&" and "  " as valid strings, and wrongly 
> rejects "ΔΣΘΛ". Your Greek customers will be sad...

Meh. You have a pedantic definition of wrong. Given the inputs, it
produced right output. Very often that's enough. Perfect is the enemy of
good, it's said. 

There's no situation where "&" and " " will exist in the given
dataset, and recognizing that is important. You don't have to account
for every bit of nonsense. 

If the OP needs a unicode-aware solution that redefines "A-Z" as perhaps
"\w" with an isupper call. Its still far simpler then you're suggesting.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
> 
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
> 
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.

You're probably right, but I find it easier to reason about matching in 
Python rather than the overly terse, cryptic regular expression mini-
language.

I haven't tested my function version, but I'm 95% sure that it is correct. 
It trickiest part of it is the logic about splitting around ampersands. And 
I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, or 
at signs". But your regex solution:

r"^[A-Z\s&]+$"

is much smaller and more compact, but *wrong*. For instance, your regex 
wrongly accepts both "&" and "  " as valid strings, and wrongly 
rejects "ΔΣΘΛ". Your Greek customers will be sad...

Oh, I just realised, I should have looked more closely at the examples 
given. because the specification given by DFS does not match the examples. 
DFS says that only uppercase letters and ampersands are allowed, but their 
examples include strings with spaces, e.g. 'FITNESS CENTERS' despite the 
lack of ampersands. (I read the spec literally as spaces only allowed if 
they surround an ampersand.) Oops, mea culpa. That makes the check function 
much simpler and easier to extend:


def check(string):
string = string.replace("&", "").replace(" ", "")
return string.isalpha() and string.isupper()


and now I'm 95% confident it is correct without testing, this time for sure!

;-)


-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Stephen Hansen
On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
> Start by writing a function or a regex that will distinguish strings that 
> match your conditions from those that don't. A regex might be faster, but 
> here's a function version.
> ... snip ...

Yikes. I'm all for the idea that one shouldn't go to regex when Python's
powerful string type can answer the problem more clearly, but this seems
to go out of its way to do otherwise.

I don't even care about faster: Its overly complicated. Sometimes a
regular expression really is the clearest way to solve a problem.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-05 Thread Steven D'Aprano
On Thursday 05 May 2016 14:58, DFS wrote:

> Want to whittle a list like this:
[...]
> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)


Start by writing a function or a regex that will distinguish strings that 
match your conditions from those that don't. A regex might be faster, but 
here's a function version.

def isupperalpha(string):
return string.isalpha() and string.isupper()

def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
# Don't strip leading spaces from the start of the string.
parts[0] = parts[0].rstrip(" ")
# Or trailing spaces from the end of the string.
parts[-1] = parts[-1].lstrip(" ")
# But strip leading and trailing spaces from the middle parts
# (if any).
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
 return all(isupperalpha(part) for part in parts)


Now you have two ways of filtering this. The obvious way is to extract 
elements which meet the condition. Here are two ways:

# List comprehension.
newlist = [item for item in oldlist if check(item)]

# Filter, Python 2 version
newlist = filter(check, oldlist)

# Filter, Python 3 version
newlist = list(filter(check, oldlist))


In practice, this is the best (fastest, simplest) way. But if you fear that 
you will run out of memory dealing with absolutely humongous lists with 
hundreds of millions or billions of strings, you can remove items in place:


def remove(func, alist):
for i in range(len(alist)-1, -1, -1):
if not func(alist[i]):
del alist[i]


Note the magic incantation to iterate from the end of the list towards the 
front. If you do it the other way, Bad Things happen. Note that this will 
use less memory than extracting the items, but it will be much slower.

You can combine the best of both words. Here is a version that uses a 
temporary list to modify the original in place:

# works in both Python 2 and 3
def remove(func, alist):
# Modify list in place, the fast way.
alist[:] = filter(check, alist)




-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-04 Thread Jussi Piitulainen
DFS writes:

. .

> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)
>
> Is it easier to extract elements meeting those conditions, or remove
> elements meeting the following conditions:
>
> * elements with a lower-case letter in them
> * elements with a number in them
> * elements with a period in them
>
> ?
>
>
> So far all I figured out is remove items with a period:
> newlist = [ x for x in oldlist if "." not in x ]
>

Either way is easy to approximate with a regex:

import re
upper = re.compile(r'[A-Z &]+')
lower = re.compile(r'[^A-Z &]')
print([datum for datum in data if upper.fullmatch(datum)])
print([datum for datum in data if not lower.search(datum)])

I've skipped testing that the ampersand is between spaces, and I've
skipped the period. Adjust.

This considers only ASCII upper case letters. You can add individual
letters that matter to you, or you can reach for the documentation to
find if there is some generic notation for all upper case letters.

The newer regex package on PyPI supports POSIX character classes like
[:upper:], I think, and there may or may not be notation for Unicode
character categories in re or regex - LU would be Letter, Uppercase.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Whittle it on down

2016-05-04 Thread Stephen Hansen
On Wed, May 4, 2016, at 09:58 PM, DFS wrote:
> Want to whittle a list like this:
> 
> [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & 
> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 
> 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 
> 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 
> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 
> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 
> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 
> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & 
> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & 
> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', 
> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 
> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 
> 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']
> 
> down to
> 
> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 
> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 
> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS',  'PERSONAL FITNESS 
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS 
> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS 
> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']

Sometimes regular expressions are the tool to do the job:

Given:

>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & 
>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 
>>> 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH 
>>> CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 
>>> 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & 
>>> FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 
>>> 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & 
>>> GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 
>>> 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS 
>>> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact 
>>> Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 
>>> 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']

Then:

>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>> output = [x for x in list if pattern.match(x)]
>>> output
['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
& GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list