Hi
 
can someone help with this please?
 
i got to this point with help from the list.
 
from BeautifulSoup import BeautifulSoupdoc = ['<html><head><title>Page 
title</title></head>',       '<body><p id="firstpara" align="center">This is 
paragraph <b>one</b>.',       '<p id="secondpara" align="blah">This is 
paragraph <b>two</b>.',       '<a href="http://www.google.co.uk";></a>',       
'<a href="http://www.bbc.co.uk";></a>',       '<a 
href="http://www.amazon.co.uk";></a>',       '<a 
href="http://www.redhat.co.uk";></a>',           '</html>']soup = 
BeautifulSoup(''.join(doc))alist = soup.findAll('a')
import urlparsefor a in alist:    href = a['href']    print 
urlparse.urlparse(href)[1]
 
so BeautifulSoup used to find <a> tags; use urlparse to extract to fully 
qualified domain name use print to print a nice list of hosts 1 per line. here
www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk
 
nice, so i think write them out to a file; change program to this to write to 
disk and read them back to see what's been done.
 
from BeautifulSoup import BeautifulSoupdoc = ['<html><head><title>Page 
title</title></head>',       '<body><p id="firstpara" align="center">This is 
paragraph <b>one</b>.',       '<p id="secondpara" align="blah">This is 
paragraph <b>two</b>.',       '<a href="http://www.google.co.uk";></a>',       
'<a href="http://www.bbc.co.uk";></a>',       '<a 
href="http://www.amazon.co.uk";></a>',       '<a 
href="http://www.redhat.co.uk";></a>',           '</html>']soup = 
BeautifulSoup(''.join(doc))alist = soup.findAll('a')
 
import urlparseoutput = open("fqdns.txt","w")
for a in alist:    href = a['href']    output.write(urlparse.urlparse(href)[1])
output.close()
 
 
this writes out www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk
 
so I look in Alan's tutor pdf for issue and read page 120 where it suggests 
doing this; outp.write(line + '\n') # \n is a newline
 
so i change my line from this
    output.write(urlparse.urlparse(href)[1])
to this
    output.write(urlparse.urlparse(href)[1] + "\n")
 
I look at the output file and I get this
 
www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk
 
hooray I think, so then I open the file in the program to read each line to do 
something with it.
i pop this after the last output.close()
 
input = open("fqdns.txt","r")for j in input:    print j
input.close()
 
but his prints out 
 
www.google.co.uk
 
www.bbc.co.uk
 
www.amazon.co.uk
 
www.redhat.co.uk
 
 
Why do i get each record with an extra new line ? Am I writing out the records 
incorrectly or am I handling them incorrectly when I open the file and print do 
I have to take out newlines as I process?
 
any help would be great
 
s
 
_________________________________________________________________
Feel like a local wherever you go.
http://www.backofmyhand.com
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to