Re: Whittle it on down

DFS Thu, 05 May 2016 07:43:41 -0700

On 5/5/2016 9:32 AM, Stephen Hansen wrote:

On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:

Oh, a further thought...


On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

I don't even care about faster: Its overly complicated. Sometimes a
regular expression really is the clearest way to solve a problem.


Putting non-ASCII letters aside for the moment, how would you match these
specs as a regular expression?


I don't know, but mostly because I wouldn't even try. The requirements
are over-specified. If you look at the OP's data (and based on previous
conversation), he's doing web scraping and trying to pull out good data.
There's no absolutely perfect way to do that because the system he's
scraping isn't meant for data processing. The data isn't cleanly
articulated.

Instead, he wants a heuristic to pull out what look like section titles.



Assigned by a company named localeze, apparently.

http://www.usdirectory.com/cat/g0

https://www.neustarlocaleze.biz/welcome/

The OP looked at the data and came up with a simple set of rules that
identify these section titles:

Want to keep all elements containing only upper case letters or upper

case letters and ampersand (where ampersand is surrounded by spaces)

This translates naturally into a simple regular expression: an uppercase
string with spaces and &'s. Now, that expression doesn't 100% encode
every detail of that rule-- it allows both Q&A and Q & A-- but on my own
looking at the data, I suspect its good enough. The titles are clearly
separate from the other data scraped by their being upper cased. We just
need to expand our allowed character range into spaces and &'s.

Nothing in the OP's request demands the kind of rigorous matching that
your scenario does. Its a practical problem with a simple, practical
answer.



Yes.  And simplicity + practicality = successfulality.

And I do a sanity check before using the data anyway: after parse andcleanup and regex matching, I make sure all lists have the same numberof elements:

lenData =[len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)]


if len(set(lenData)) != 1:  alert the media


--
https://mail.python.org/mailman/listinfo/python-list

Re: Whittle it on down

Reply via email to