Attempting a simple web crawler in python3, browse a url - collect/clean/produce valid url list from a url; random choice from valid url list; visit new url repeat process - collect/clean/produce valid url list from new url...
So there are many tools out there but mechanize and scrapy 3rd party modules seem to produce the best results; however nothing like these exist for Python3. I get close but cannot produce the clean simple url results in python3 my borrowed bastardized python2.7 code producing clean url list from "nytimes" example: ### #MechSnake2 import urllib import mechanize import cookielib from bs4 import BeautifulSoup import urlparse # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) # Follows refresh 0 but not hangs on refresh > 0 br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) # debugging messages br.set_debug_http(True) br.set_debug_redirects(True) br.set_debug_responses(True) # User-Agent Not a Robot here spoof Mozilla coolness.... br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] url = ("http://www.nytimes.com") br.open(url) for link in br.links(): newurl = urlparse.urljoin(link.base_url, link.url) b1 = urlparse.urlparse(newurl).hostname b2 = urlparse.urlparse(newurl).path print "http://"+b1+b2 ### Works like a charm! But now go do the same in python3? I am stumped... Most of the other project code I am working on is in Python3 and works great! I can produce everything above except the last parse join cleaning process (urllib.request.Request(url, headers = Mozilla/5.0 etc...for spoofing) but then it fails to produce the same clean list because of the Mechanize mojo producing link info that easily allows the parse join cleaning process at the end. So can one call the python27 program and use results in python33? Thank you in advance! Another nightmare is stripping-cleaning google search results into clean url lists... but I will defer to another post for that mess. Thanks again in advance Paul Smith +++ Two Kinds of Intelligence There are two kinds of intelligence: one acquired, as a child in school memorizes facts and concepts from books and from what the teacher says, collecting information from the traditional sciences as well as from the new sciences. With such intelligence you rise in the world. You get ranked ahead or behind others in regard to your competence in retaining information. You stroll with this intelligence in and out of fields of knowledge, getting always more marks on your preserving tablets. There is another kind of tablet, one already completed and preserved inside you. A spring overflowing its springbox. A freshness in the center of the chest. This other intelligence does not turn yellow or stagnate. It's fluid, and it doesn't move from outside to inside through conduits of plumbing-learning. This second knowing is a fountainhead from within you, moving out. Mewlana Jalaluddin Rumi +++
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor