Hi
 
i was expanding my program to write urls parsed from a html page and write them 
to a file so i chose www.icq.com to extract the urls from.
 
when i wrote these out to a file and then read the file back I noticed a list 
of urls then some blank lines then some more urls then some blank lines, does 
this mean that one of the functions called has for some reason added some 
whitespace into some of the list items so that i wrote them out to disk?
 
I also noticed that there are duplicate hosts/urls that have been written to 
the file.
 
So my two questions are;
1. how and where do I tackle removing the whitespace from being written out to 
disk?
 
2. how do i tackle checking for duplicate entries in a list before writing them 
out to disk?
 
My code is below 
from BeautifulSoup import BeautifulSoupimport urllib2import urlparse
file = urllib2.urlopen("http://www.icq.com";)
soup = BeautifulSoup(''.join(file))alist = soup.findAll('a')
output = open("fqdns.txt","w")
for a in alist:    href = a['href']    output.write(urlparse.urlparse(href)[1] 
+ "\n")
output.close()
input = open("fqdns.txt","r")
for j in input:    print j,
input.close()
the chopped output is here 
 
chat.icq.comchat.icq.comchat.icq.comchat.icq.comchat.icq.com
 
 
labs.icq.comdownload.icq.comgreetings.icq.comgreetings.icq.comgreetings.icq.comgames.icq.comgames.icq.com
_________________________________________________________________
Celeb spotting – Play CelebMashup and win cool prizes
https://www.celebmashup.com
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to