On 5/6/2016 9:58 AM, DFS wrote:
On 5/6/2016 3:45 AM, Peter Otten wrote:
DFS wrote:

Should've looked earlier.  Their master list of categories
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.

If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with

import urllib.request
import bs4
soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0";).read())

categories = set()
for li in soup.find_all("li"):
...     assert li.parent.parent["class"][0].startswith("category_items")
...     categories.add(li.text)
...
print("\n".join(sorted(categories)[:10]))



"import urllib.request
ImportError: No module named request"


Figured it out using urllib2. Your code returns 411 categories from that first page.

There are up to 4 levels of categorization:


Level 1: Arts & Entertainment
Level 2:   Newspapers

Level 3:     Newspaper Brokers
Level 3:     Newspaper Dealers Back Number
Level 3:     Newspaper Delivery
Level 3:     Newspaper Distributors
Level 3:     Newsracks
Level 3:     Printers Newspapers
Level 3:     Newspaper Dealers

Level 3:     News Dealers
Level 4:       News Dealers Wholesale
Level 4:       Shoppers News Publications

Level 3:     News Service
Level 4:       Newspaper Feature Syndicates
Level 4:       Prepress Services




http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 Level 2. To get the Level 3 and 4 you have to drill-down using the hyperlinks.

How to do it in python code is beyond my skills at this point. Get the hrefs and load them and parse, then get the next level and load them and parse, etc.?




--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to