Let me describe a situation.
We have a lot of code working with lxml.html.HtmlElement. Now we want to
support HTML5. html5lib is to slow for our requirements. Other libraries works
with C language for best performance.
E.g:
- https://github.com/kovidgoyal/html5-parser/blob/master/src/as-libxml.c (gumbo)
- http://source.netsurf-browser.org/libhubbub.git/tree/examples/libxml.c
(libhubbub)
-
https://github.com/SimonSapin/html5ever-python/blob/master/html5ever/elementtree.py
(html5ever)
Converting(https://github.com/whalebot-helmsman/html5-parser/blob/lxml-html/src/html5_parser/lxml_html.py)
python structures by copying attributes, text and tail from
lxml.etree._Element to lxml.html.HtmlElement also slower than our current HTML4
code (~20%)
As I understand there is no difference between HTML and XML in C language.
lxml.html.HtmlElement is a python structure.
Is it possible to have lxml.html.HtmlElement on top of lxml.etree._Element
without copying(performance drop)?
May be I am missing other possibility?
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]