Duncan Booth wrote: > John Nagle <[EMAIL PROTECTED]> wrote: > > >>And this came out, via prettify: >> >><addresssnippet siteurl="http%3A//apartmentsapart.com" >>url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ"> >> <param name="movie" >> value="/images/offersBanners/sw04.swf?binfot=We offer >>fantastic rates for selected weeks or days!!&blinkt=Click here >>>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408"> >> >>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" /> >> >></param> >> >>BeautifulSoup seems to have become confused by the ">>>" within >>a quoted attribute value. It first parsed it right, but then stuck >>in an extra, totally bogus line. Note the entity "&linkurl;", which >>appears nowhere in the original. It looks like code to handle a >>missing quote mark did the wrong thing. > > > I don't think I would quibble with what BeautifulSoup extracted from that > mess. The input isn't valid HTML so any output has to be guessing at what > was meant. A lot of code for parsing html would assume that there was a > quote missing and the tag was terminated by the first '>'. IE and Firefox > seem to assume that the '>' is allowed inside the attribute. BeautifulSoup > seems to have given you the best of both worlds: the attribute is parsed to > the closing quote, but the tag itself ends at the first '>'. > > As for inserting a semicolon after linkurl, I think you'll find it is just > being nice and cleaning up an unterminated entity. Browsers (or at least > IE) will often accept entities without the terminating semicolon, so that's > a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output: &linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" /> That "/>" doesn't match anything. We're outside a tag at that point. And it was introduced by BeautifulSoup. That's both wrong and puzzling; given that this was created from a parse tree, that type of error shouldn't ever happen. This looks like the parser didn't delete a string item after deciding it was actually part of a tag. John Nagle -- http://mail.python.org/mailman/listinfo/python-list