On 5/6/2016 3:45 AM, Peter Otten wrote:
DFS wrote:

Should've looked earlier.  Their master list of categories
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.

If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with

import urllib.request
import bs4
soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0";).read())
categories = set()
for li in soup.find_all("li"):
...     assert li.parent.parent["class"][0].startswith("category_items")
...     categories.add(li.text)
...
print("\n".join(sorted(categories)[:10]))



"import urllib.request
ImportError: No module named request"


I'm on python 2.7.11





Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services




Yeah, I actually did something like that last night.  Was trying to get
their full tree structure, which goes 4 levels deep: ie

Arts & Entertainment
  Newpapers
   News Dealers
    Prepess Services


What I referred to as their 'master list' is actually just 2 levels deep. My bad.

So far I haven't come across one that had anything in it but letters, dashes, commas or ampersands.

Thanks
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to