in python3.3?

Paul Smith Fri, 13 Sep 2013 08:01:17 -0700

Attempting a simple web crawler in python3, browse a url -
collect/clean/produce valid url list from a url; random choice from valid
url list; visit new url repeat process - collect/clean/produce valid url
list from new url...


So there are many tools out there but mechanize and scrapy 3rd party
modules seem to produce the best results; however nothing like these exist
for Python3. I get close but cannot produce the clean simple url results in
python3

my borrowed bastardized python2.7 code producing clean url list from
"nytimes" example:
###

#MechSnake2

import urllib
import mechanize
import cookielib
from bs4 import BeautifulSoup
import urlparse
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# debugging messages
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent Not a Robot here spoof Mozilla coolness....
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = ("http://www.nytimes.com";)
br.open(url)
for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
print "http://"+b1+b2

###
Works like a charm!

But now go do the same in python3? I am stumped...

Most of the other project code I am working on is in Python3 and works
great! I can produce everything above except the last parse join cleaning
process (urllib.request.Request(url, headers = Mozilla/5.0 etc...for
spoofing) but then it fails to produce the same clean list because of the
Mechanize mojo producing link info that easily allows the parse join
cleaning process at the end.

So can one call the python27 program and use results in python33?

Thank you in advance!

Another nightmare is stripping-cleaning google search results into clean
url lists... but I will defer to another post for that mess.

Thanks again in advance

Paul Smith

+++

Two Kinds of Intelligence

There are two kinds of intelligence: one acquired,
as a child in school memorizes facts and concepts
from books and from what the teacher says,
collecting information from the traditional sciences
as well as from the new sciences.


With such intelligence you rise in the world.
You get ranked ahead or behind others
in regard to your competence in retaining
information. You stroll with this intelligence
in and out of fields of knowledge, getting always more
marks on your preserving tablets.


There is another kind of tablet, one
already completed and preserved inside you.
A spring overflowing its springbox. A freshness
in the center of the chest. This other intelligence
does not turn yellow or stagnate. It's fluid,
and it doesn't move from outside to inside
through conduits of plumbing-learning.


This second knowing is a fountainhead
from within you, moving out.

Mewlana Jalaluddin Rumi

+++

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] Newbie question. Is it possible to run/call a python2.7 program and reference results from/in python3.3?

Reply via email to