Hi
i was expanding my program to write urls parsed from a html page and write them
to a file so i chose www.icq.com to extract the urls from.
when i wrote these out to a file and then read the file back I noticed a list
of urls then some blank lines then some more urls then some blank lines, does
this mean that one of the functions called has for some reason added some
whitespace into some of the list items so that i wrote them out to disk?
I also noticed that there are duplicate hosts/urls that have been written to
the file.
So my two questions are;
1. how and where do I tackle removing the whitespace from being written out to
disk?
2. how do i tackle checking for duplicate entries in a list before writing them
out to disk?
My code is below
from BeautifulSoup import BeautifulSoupimport urllib2import urlparse
file = urllib2.urlopen("http://www.icq.com")
soup = BeautifulSoup(''.join(file))alist = soup.findAll('a')
output = open("fqdns.txt","w")
for a in alist: href = a['href'] output.write(urlparse.urlparse(href)[1]
+ "\n")
output.close()
input = open("fqdns.txt","r")
for j in input: print j,
input.close()
the chopped output is here
chat.icq.comchat.icq.comchat.icq.comchat.icq.comchat.icq.com
labs.icq.comdownload.icq.comgreetings.icq.comgreetings.icq.comgreetings.icq.comgames.icq.comgames.icq.com
_________________________________________________________________
Celeb spotting – Play CelebMashup and win cool prizes
https://www.celebmashup.com_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor