Robert Kern wrote: > Carl Banks wrote: > >>On Apr 4, 4:55 pm, Robert Kern <[EMAIL PROTECTED]> wrote: >> >>>Carl Banks wrote: >>> >>>>On Apr 4, 2:43 pm, Robert Kern <[EMAIL PROTECTED]> wrote: >>>> >>>>>Carl Banks wrote: >>>>> >>>>>>On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>>BeautifulSoup can't parse this page usefully at all. >>>>>>>It treats the entire page as a text chunk. It's actually >>>>>>>HTMLParser that parses comments, so this is really an HTMLParser >>>>>>>level problem.
>>I think the authors of BeautifulSoup have the right to decide what >>their own mission is. > > > Yes, and he's stated it pretty clearly: > > """You didn't write that awful page. You're just trying to get some data out > of > it. Right now, you don't really care what HTML is supposed to look like. > > Neither does this parser.""" That's a good summary of the issue. It's a real problem, because BeautifulSoup's default behavior in the presence of a bad comment is to silently suck up all remaining text, ignoring HTML markup. The problem actually is in BeautifulSoup, in parse_declaration: def parse_declaration(self, i): """Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.""" j = None if self.rawdata[i:i+9] == '<![CDATA[': k = self.rawdata.find(']]>', i) if k == -1: k = len(self.rawdata) data = self.rawdata[i+9:k] j = k+3 self._toStringSubclass(data, CData) else: try: j = SGMLParser.parse_declaration(self, i) except SGMLParseError: toHandle = self.rawdata[i:] self.handle_data(toHandle) j = i + len(toHandle) return j Note what happens when a bad declaration is found. SGMLParser.parse_declaration raises SGMLParseError, and the exception handler just sucks up the rest of the input (note that "rawdata[i:]"), treats it as unparsed data, and advances the position to the end of input. That's too brutal. One bad declaration and the whole parse is messed up. Something needs to be done at the BeautifulSoup level to get the parser back on track. Maybe suck up input until the next ">", treat that as data, then continue parsing from that point. That will do the right thing most of the time, although bad declarations containing a ">" will still be misparsed. How about this patch? except SGMLParseError: # bad decl, must recover k = self.rawdata.find('>', i) # find next ">" if k == -1 : # if no find k = len(self.rawdata) # use entire string toHandle = self.rawdata[i:k] # take up to ">" as data self.handle_data(toHandle) # treat as data j = i + len(toHandle) # pick up parsing after ">" This is untested, but this or something close to it should make BeautifulSoup much more robust. It might make sense to catch some SGMLParseError at some other places, too, advance past the next ">", and restart parsing. John Nagle -- http://mail.python.org/mailman/listinfo/python-list