Roy Hinkelman wrote:

Your list is great. I've been lurking for the past two weeks while I learned the basics. Thanks.

I am trying to loop thru 2 files and scrape some data, and the loops are not working.

The script is not getting past the first URL from state_list, as the test print shows.

If someone could point me in the right direction, I'd appreciate it.

I would also like to know the difference between open() and csv.reader(). I had similar issues with csv.reader() when opening these files.

Any help greatly appreciated.

Roy

Code: Select all
    # DOWNLOAD USGS MISSING FILES

    import mechanize
    import BeautifulSoup as B_S
    import re
    # import urllib
    import csv

    # OPEN FILES
    # LOOKING FOR THESE SKUs
_missing = open('C:\\Documents and Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv', 'r')
    # IN THESE STATES
_states = open('C:\\Documents and Settings\\rhinkelman\\Desktop\\working DB files\\state_list.csv', 'r')
    # IF NOT FOUND, LIST THEM HERE
    _missing_files = []
    # APPEND THIS FILE WITH META
_topo_meta = open('C:\\Documents and Settings\\rhinkelman\\Desktop\\working DB files\\topo_meta.csv', 'a')

    # OPEN PAGE
    for each_state in _states:
        each_state = each_state.replace("\n", "")
        print each_state
        html = mechanize.urlopen(each_state)
        _soup = B_S.BeautifulSoup(html)
# SEARCH THRU PAGE AND FIND ROW CONTAINING META MATCHING SKU
        _table = _soup.find("table", "tabledata")
        print _table #test This is returning 'None'

If you take a look at the webpage you open up, you will notice there are no tables. Are you certain you are using the correct URLs for this ?
        for each_sku in _missing:
The for loop `for each_sku in _missing:` will only iterate once, you can either pre-read it into a list / dictionary / set (whichever you prefer) or change it to _missing_filename = 'C:\\Documents and Settings\\rhinkelman\\Desktop\\working DB files\\missing_topo_list.csv'
for each_sku in open(_missing_filename):
   # carry on here
            each_sku = each_sku.replace("\n","")
            print each_sku #test
            try:
                _row = _table.find('tr', text=re.compile(each_sku))
            except (IOError, AttributeError):
                _missing_files.append(each_sku)
                continue
            else:
                _row = _row.previous
                _row = _row.parent
                _fields = _row.findAll('td')
                _name = _fields[1].string
                _state = _fields[2].string
                _lat = _fields[4].string
                _long = _fields[5].string
                _sku = _fields[7].string

_topo_meta.write(_name + "|" + _state + "|" + _lat + "|" + _long + "|" + _sku + "||") print x +': ' + _name

    print "Missing Files:"
    print _missing_files
    _topo_meta.close()
    _missing.close()
    _states.close()


The message I am getting is:

Code:
    >>>
    http://libremap.org/data/state/Colorado/drg/
    None
    33087c2
    Traceback (most recent call last):
File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code samples/usgs_missing_file_META.py", line 34, in <module>
        _row = _table.find('tr', text=re.compile(each_sku))
    AttributeError: 'NoneType' object has no attribute 'find'


And the files look like:

Code:
    state_list
    http://libremap.org/data/state/Colorado/drg/
    http://libremap.org/data/state/Connecticut/drg/
    http://libremap.org/data/state/Pennsylvania/drg/
    http://libremap.org/data/state/South_Dakota/drg/

    missing_topo_list
    33087c2
    34087b2
    33086b7
    34086c2


------------------------------------------------------------------------

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Hope the comments above help in your endeavours.

--
Kind Regards,
Christian Witts


_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to