Frank Stutzman <[EMAIL PROTECTED]> wrote: > I've got a simple script that looks like (watch the wrap): > --------------------------------------------------- > import BeautifulSoup,urllib > > ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId > ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read() > > soup=BeautifulSoup.BeautifulSoup(ifile) > print soup.prettify() > ---------------------------------------------------- > > and all I get out of it is garbage.
Same for me. > I did some poking and proding and it seems that there is something in the > <head> clause that is causing the problem. Heck if I can see what it is. The problem is this line: <META http-equiv="Content-Type" content="text/html; charset=UTF-16"> Which is wrong. The content is not utf-16 encoded. The line after that declares the charset as utf-8, which is correct, although ascii would be ok too. If I save the search result and remove this line, everything works. So, you could: - ignore problematic pages - save and edit them, then reparse them (not always practical) - use the fromEncoding argument: soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8") (or 'ascii'). Of course this only works if you guess/predict the encoding correctly ;) Which can be difficult. Since BeautifulSoup uses "an encoding discovered in the document itself" (quote from <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>) when the encoding you supply does not work, using fromEncoding="ascii" should not hurt too much. But this being usenet, I'm sure someone will tell me that I'm wrong and there is some weird 7bit encoding in use somewhere on the web... > I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something > dumb, you don't need to be gentle. No, you did nothing dumb. The server sent you broken content. Ciao Marc -- http://mail.python.org/mailman/listinfo/python-list