Roy Hinkelman wrote:
Your list is great. I've been lurking for the past two weeks while I
learned the basics. Thanks.
I am trying to loop thru 2 files and scrape some data, and the loops
are not working.
The script is not getting past the first URL from state_list, as the
test print shows.
If someone could point me in the right direction, I'd appreciate it.
I would also like to know the difference between open() and
csv.reader(). I had similar issues with csv.reader() when opening
these files.
Any help greatly appreciated.
Roy
Code: Select all
# DOWNLOAD USGS MISSING FILES
import mechanize
import BeautifulSoup as B_S
import re
# import urllib
import csv
# OPEN FILES
# LOOKING FOR THESE SKUs
_missing = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB
files\\missing_topo_list.csv', 'r')
# IN THESE STATES
_states = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB files\\state_list.csv', 'r')
# IF NOT FOUND, LIST THEM HERE
_missing_files = []
# APPEND THIS FILE WITH META
_topo_meta = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB files\\topo_meta.csv', 'a')
# OPEN PAGE
for each_state in _states:
each_state = each_state.replace("\n", "")
print each_state
html = mechanize.urlopen(each_state)
_soup = B_S.BeautifulSoup(html)
# SEARCH THRU PAGE AND FIND ROW CONTAINING META MATCHING SKU
_table = _soup.find("table", "tabledata")
print _table #test This is returning 'None'
If you take a look at the webpage you open up, you will notice there are
no tables. Are you certain you are using the correct URLs for this ?
for each_sku in _missing:
The for loop `for each_sku in _missing:` will only iterate once, you can
either pre-read it into a list / dictionary / set (whichever you prefer)
or change it to
_missing_filename = 'C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv'
for each_sku in open(_missing_filename):
# carry on here
each_sku = each_sku.replace("\n","")
print each_sku #test
try:
_row = _table.find('tr', text=re.compile(each_sku))
except (IOError, AttributeError):
_missing_files.append(each_sku)
continue
else:
_row = _row.previous
_row = _row.parent
_fields = _row.findAll('td')
_name = _fields[1].string
_state = _fields[2].string
_lat = _fields[4].string
_long = _fields[5].string
_sku = _fields[7].string
_topo_meta.write(_name + "|" + _state + "|" + _lat +
"|" + _long + "|" + _sku + "||")
print x +': ' + _name
print "Missing Files:"
print _missing_files
_topo_meta.close()
_missing.close()
_states.close()
The message I am getting is:
Code:
>>>
http://libremap.org/data/state/Colorado/drg/
None
33087c2
Traceback (most recent call last):
File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
samples/usgs_missing_file_META.py", line 34, in <module>
_row = _table.find('tr', text=re.compile(each_sku))
AttributeError: 'NoneType' object has no attribute 'find'
And the files look like:
Code:
state_list
http://libremap.org/data/state/Colorado/drg/
http://libremap.org/data/state/Connecticut/drg/
http://libremap.org/data/state/Pennsylvania/drg/
http://libremap.org/data/state/South_Dakota/drg/
missing_topo_list
33087c2
34087b2
33086b7
34086c2
------------------------------------------------------------------------
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Hope the comments above help in your endeavours.
--
Kind Regards,
Christian Witts
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor