On 11/07/2012 10:44 AM, Seema V Srivastava wrote: > Hi, > I am new to Python, trying to learn it by carrying out specific tasks. I > want to start with trying to scrap the contents of a web page. I have > downloaded Python 3.3 and BeautifulSoup 4. > > If I call upon urlopen in any form, such as below, I get the error as shown > below the syntax: Does urlopen not apply to Python 3.3? If not then > what;s the syntax I should be using? Thanks so much. > > import urllib > from bs4 import BeautifulSoup > soup = BeautifulSoup(urllib.urlopen("http://www.pinterest.com")) > > Traceback (most recent call last): > File "C:\Users\Seema\workspace\example\main.py", line 3, in <module> > soup = BeautifulSoup(urllib.urlopen("http://www.pinterest.com")) > AttributeError: 'module' object has no attribute 'urlopen' > >
Since you're trying to learn, let me point out a few things that would let you teach yourself, which is usually quicker and more effective than asking on a mailing list. (Go ahead and ask, but if you figure out the simpler ones yourself, you'll learn faster) (BTW, I'm using 3.2, but it'll probably be very close) First, that error has nothing to do with BeautifulSoup. If it had, I wouldn't have responded, since I don't have any experience with BS. The way you could learn that for yourself is to factor the line giving the error: tmp = urllib.urlopen("http://www.pinterest.com") soup = BeautifulSoup(tmp) Now, you'll get the error on the first line, before doing anything with BeautifulSoup. Now that you have narrowed it to urllib.urlopen, go find the docs for that. I used DuckDuckGo, with keywords python urllib urlopen, and the first match was: http://docs.python.org/2/library/urllib.html and even though this is 2.7.3 docs, the first paragraph tells you something useful: Note The urllib <http://docs.python.org/2/library/urllib.html#module-urllib> module has been split into parts and renamed in Python 3 to urllib.request, urllib.parse, and urllib.error. The /2to3/ <http://docs.python.org/2/glossary.html#term-to3> tool will automatically adapt imports when converting your sources to Python 3. Also note that the urllib.urlopen() <http://docs.python.org/2/library/urllib.html#urllib.urlopen> function has been removed in Python 3 in favor of urllib2.urlopen() <http://docs.python.org/2/library/urllib2.html#urllib2.urlopen>. Now, the next question I'd ask is whether you're working from a book (or online tutorial), and that book is describing Python 2.x If so, you might encounter this type of pain many times. Anyway, another place you can learn is from the interactive interpreter. just run python3, and experiment. >>> import urllib >>> urllib.urlopen Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlopen' >>> dir(urllib) ['__builtins__', '__cached__', '__doc__', '__file__', '__name__', '__package__', '__path__'] >>> Notice that dir shows us the attributes of urllib, and none of them look directly useful. That's because urllib is a package, not just a module. A package is a container for other modules. We can also look __file__ >>> urllib.__file__ '/usr/lib/python3.2/urllib/__init__.py' That __init__.py is another clue; that's the way packages are initialized. But when I try importing urllib2, I get ImportError: No module named urllib2 So back to the website. But using the dropdown at the upper left, i can change from 2.7 to 3.3: http://docs.python.org/3.3/library/urllib.html There it is quite explicit. urllib is a package that collects several modules for working with URLs: * urllib.request <http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request> for opening and reading URLs * urllib.error <http://docs.python.org/3.3/library/urllib.error.html#module-urllib.error> containing the exceptions raised by urllib.request <http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request> * urllib.parse <http://docs.python.org/3.3/library/urllib.parse.html#module-urllib.parse> for parsing URLs * urllib.robotparser <http://docs.python.org/3.3/library/urllib.robotparser.html#module-urllib.robotparser> for parsing robots.txt files So, if we continue to play with the interpreter, we can try: >>> import urllib.request >>> dir(urllib.request) ['AbstractBasicAuthHandler', 'AbstractDigestAuthHandler', 'AbstractHTTPHandler', 'BaseHandler', 'CacheFTPHandler', 'ContentTooShortError', 'FTPHandler', 'FancyURLopener', 'FileHandler', 'HTTPBasicAuthHandler', 'HTTPCookieProcessor', 'HTTPDefaultErrorHandler', 'HTTPDigestAuthHandler', 'HTTPError', 'HTTPErrorProcessor', ...... 'urljoin', 'urlopen', 'urlparse', 'urlretrieve', 'urlsplit', 'urlunparse'] I chopped off part of the long list of things that was imported in that module. But one of them is urlopen, which is what you were looking for before. So back to your own sources, try: >>> tmp = urllib.request.urlopen("http://www.pinterest.com") >>> tmp <http.client.HTTPResponse object at 0x1df1c10> OK, the next thing you might wonder is what parameters urlopen might take: Help on function urlopen in module urllib.request: >>> help(urllib.request.urlopen) urlopen(url, data=None, timeout=<object object>, *, cafile=None, capath=None) (END) Hopefully, this will get you started into BeautifulSoup. As i said before, I have no experience with that part. Note that I normally use the docs.python.org documentation much more. But a quick question to the interpreter can be very useful, especially if you don't have internet access. -- DaveA _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor