Re: Get html DOM tree by only basic builtin moudles

Ian Kelly Fri, 05 Jun 2015 12:52:11 -0700

On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nisp...@gmail.com> wrote:
> Hi Laura,
>   Sure, I got special requirement that just parse html file into DOM tree, by 
> only general basic modules, and based on my DOM tree structure, draft an 
> bitmap.
>
>   So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read 
> the file line by line, adding to a stack data structure(list for example), 
> and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.


If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.

If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.

I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:

<!DOCTYPE html>
<html>
  <head><title>Document</title></head>
  <body>
    First line
    <br>
    Second line
  </body>
</html>
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Get html DOM tree by only basic builtin moudles

Reply via email to