On 5/6/2016 3:45 AM, Peter Otten wrote:
DFS wrote:
Should've looked earlier. Their master list of categories
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
and the ampersands we talked about.
"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
"AUTOMOBILE - DEALERS" gets removed because of the dash.
I updated your regex and it seems to have fixed it.
orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")
Thanks again.
If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.
categories = set(master_list)
output = [category for category in input if category in categories]
You can find the categories with
import urllib.request
import bs4
soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
categories = set()
for li in soup.find_all("li"):
... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...
print("\n".join(sorted(categories)[:10]))
"import urllib.request
ImportError: No module named request"
I'm on python 2.7.11
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services
Yeah, I actually did something like that last night. Was trying to get
their full tree structure, which goes 4 levels deep: ie
Arts & Entertainment
Newpapers
News Dealers
Prepess Services
What I referred to as their 'master list' is actually just 2 levels
deep. My bad.
So far I haven't come across one that had anything in it but letters,
dashes, commas or ampersands.
Thanks
--
https://mail.python.org/mailman/listinfo/python-list