Hello, authors of libxml2.

I'm using libxml2 to parse HTML and it sometimes produces the wrong result.
In some weird circumstances, when the parser sees "</script>" it won't
close the script tag, but instead it will literally add "</script>" to the
text node and continue parsing the rest of the input verbatim as if it was
script content.

After a lot of debugging, I determined the problem is in libxml2 and not
the other libraries in my stack, and that it only seems to happen on
version 2.9.8. But I don't see any related changes in news.html for 2.9.9,
nor in the diff between them, so I am still worried: I don't know if the
bug is really fixed, or just dormant. I hope you can find the root cause,
and maybe add a regression test if you do.

## How to reproduce:

Test case:
https://gist.github.com/TomiBelan/c9949b6519d115500b742393da61b188

I've uploaded an example html file that exhibits this problem, and a Python
program to show it. It uses libxml2 via the lxml library. This will
download the manylinux binary build of lxml 4.2.5, which is statically
linked to libxml2 2.9.8.
$ virtualenv -p python3 ./venv
$ ./venv/bin/pip install --upgrade pip
$ ./venv/bin/pip install lxml==4.2.5
$ ./venv/bin/python test.py

I couldn't shorten the file very much, because if I delete even a single
character, the bug stops triggering. (Could it be some buffer boundary
issue?) Instead, I replaced most unimportant tags and text nodes with dummy
text.

## Affected versions:

- lxml 4.1.1, which contains libxml2 2.9.7, is not affected.
- lxml 4.2.5, which contains libxml2 2.9.8, IS affected.
- lxml 4.3.0, which contains libxml2 2.9.9, is not affected (at least for
this particular html file).

I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected.
So I believe this is a bug in libxml2 2.9.8 specifically, and not in a
particular version of lxml. I used this command:

STATIC_DEPS=true LDFLAGS='-flto -fPIC' CFLAGS='-O3 -g1 -pipe -fPIC -flto'
LIBXML2_VERSION=2.9.9 ./venv/bin/pip install --no-binary :all: -vvv
lxml==4.2.5

My particular test case doesn't trigger the bug in 2.9.9, but I don't know
if that's because the bug is really fixed, or some constants/offsets have
changed and now it triggers on other html files.

I hope you can solve the mystery. Please let me know if I can be of any
help. And thanks for reading!

Tomi
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to