On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nisp...@gmail.com> wrote: > Hi Laura, > Sure, I got special requirement that just parse html file into DOM tree, by > only general basic modules, and based on my DOM tree structure, draft an > bitmap. > > So, could you give me an direction how to get the DOM tree? > Currently, I just think out to use something like stack, I mean, maybe read > the file line by line, adding to a stack data structure(list for example), > and, then, got the parent/child relation .etc > > I don't know if what I said is easy to achieve, I am just trying. > Any better suggestions will be great appreciated.
If you want to recreate the same DOM structure that would be created by a browser, the standardized algorithm to do so is very complicated, but you can find it at http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html. If you're not necessarily seeking perfect fidelity, I would encourage you to try to find some way to incorporate beautifulsoup into your project. It likely won't produce the same structure that a real browser would, but it should do well enough to scrape from even badly malformed html. I recommend against using an XML parser, because HTML isn't XML, and such a parser may choke even on perfectly valid HTML such as this: <!DOCTYPE html> <html> <head><title>Document</title></head> <body> First line <br> Second line </body> </html> -- https://mail.python.org/mailman/listinfo/python-list