[EMAIL PROTECTED] wrote: > Hi, I am looking for a HTML parser who can parse a given page into > a DOM tree, and can reconstruct the exact original html sources. > Strictly speaking, I should be allowed to retrieve the original > sources at each internal nodes of the DOM tree. > I have tried Beautiful Soup who is really nice when dealing with > those god damned ill-formed documents, but it's a pity for me to find > that this guy cannot retrieve original sources due to its great tidy > job. > Since Beautiful Soup, like most of the other HTML parsers in > python, is a subclass of sgmllib.SGMLParser to some extent, I have > investigated the source code of sgmllib.SGMLParser, see if there is > anything I can do to tell Beautiful Soup where he can find every tag > segment from HTML source, but this will be a time-consuming job. > so... any ideas? >
A while ago I had a similar need, but my solution may not solve your problem. I wanted to rewrite URLs contained in links and images etc, but not modify any of the rest of the HTML. I created an HTML parser (based on sgmllib) with callbacks as it encounters tags and attributes etc. It is easy to process a stream without 'damaging' the beautiful orginal structure of crap HTML - but it doesn't provide a DOM. http://www.voidspace.org.uk/python/recipebook.shtml#scraper All the best, Michael Foord http://www.manning.com/foord > > cheers > kai liu -- http://mail.python.org/mailman/listinfo/python-list