Thanks MRAB as well. I've printed all of the replies to retain with my
pile of essential documentation.

To follow up with a complete response, I'm ripping out of my mechanize
module the essential components of the solution I got to work.

The main body of the code passes a URL to the scrape_records function.
The function attempts to open the URL five times.

If the URL is opened, a values dictionary is populated and returned to
the calling statement. If the URL cannot be opened, a fatal error is
printed and the module terminates. There's a little sleep call in the
function to leave time for any errant connection problem to resolve
itself.

Thanks to all for your replies. I hope this helps someone else:

import urllib2, time
from mechanize import Browser

def scrape_records(url):
    maxattempts = 5
    br = Browser()
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
    br.addheaders = [('User-agent', user_agent)]
    for count in xrange(maxattempts):
        try:
            print url, count
            br.open(url)
            break
        except urllib2.URLError:
            print 'URL error', count
            # Pretend a failed connection was fixed
            if count == 2:
                url = 'http://www.google.com'
            time.sleep(1)
            pass
    else:
        print 'Fatal URL error. Process terminated.'
        return None
    # Scrape page and populate valuesDict
    valuesDict = {}
    return valuesDict

url = 'http://badurl'
valuesDict = scrape_records(url)
if valuesDict == None:
    print 'Failed to retrieve valuesDict'
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to