I wanted to strip the quotes from IMDB quote pages, just to start
learning python. Quotes are not nested, so I got the anchor links that
precede them. I thought I could walk down until I hit an HR tag,
meanwhile grabbing people and quotes via hits on <b> and <br>.
But once I tried to walk down from my hit on the anchor link and pull
the name, I found I kept getting a NavigableString instead of tag, so
asking for the .name attribute gave an error.

Any idea why this might happen?


This is the relevant chunk of IMDB code:

<a name="qt0210620"></a>

<b><a href="/name/nm0629454/">Bill</a></b>:
You're supposed to wear the blue dress when I wear this.
<br>

<b><a href="/name/nm0707043/">Mary</a></b>:
I don't want to dress like twins anymore.
<br>

<b><a href="/name/nm0629454/">Bill</a></b>:
We're not twins. We're a trio.
<br>
<hr width="30%">


---


And this is what I wrote (and if there are other awful things about
this, I would be happy to know):


#!/usr/bin/env python

import urllib2
from BeautifulSoup import BeautifulSoup
import re


# stubs --------------------------

movietitle_stub = "Nashville"                                                   
#later search an pull first
result (if movie?)
movieurl_stub = "http://imdb.com/title/tt0073440/";              #and get this



def soupifyPage(target):
        """
        grab html from a page
        probably need real method of checking for failure, huh
        """
        codeReq = urllib2.Request(target)
        response = urllib2.urlopen(codeReq)
        soupyhtml = BeautifulSoup(response)
        return soupyhtml


def pullQuote(curTag):
        # character is in bold
        print curTag.nextSibling.name
        '''
        if curTag.nextSibling.name == 'hr':
                #are done
                return quoteBlock
        print "seeing" + curTag.nextSibling.name
        quoteBlock = quoteBlock + " - " + curTag.nextSibling.name
        curTag = curTag.nextSibling
        '''




quotepage = movieurl_stub + "quotes"
print "Getting this:" + quotepage
print "---------------"
quotebag = soupifyPage(quotepage)


# each quote is preceded by anchorlink, begins with qt : example <a
name="qt0229419"></a>
# the end with an HR tag
# they are not nested

quotations = quotebag.findAll(attrs = {'name' : re.compile("^qt")})

for q in quotations:
        #pullQuote(q)
        print q.nextSibling.name  # attribute error: "'NavigableString'
object has no attribute 'name'"
        print "next!"
                



Thanks,
Clay

- - - - - - -

Clay S. Wiedemann
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to