Hi, I am looking for a HTML parser who can parse a given page into a DOM tree, and can reconstruct the exact original html sources. Strictly speaking, I should be allowed to retrieve the original sources at each internal nodes of the DOM tree. I have tried Beautiful Soup who is really nice when dealing with those god damned ill-formed documents, but it's a pity for me to find that this guy cannot retrieve original sources due to its great tidy job. Since Beautiful Soup, like most of the other HTML parsers in python, is a subclass of sgmllib.SGMLParser to some extent, I have investigated the source code of sgmllib.SGMLParser, see if there is anything I can do to tell Beautiful Soup where he can find every tag segment from HTML source, but this will be a time-consuming job. so... any ideas?
cheers kai liu -- http://mail.python.org/mailman/listinfo/python-list