Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
Thanks, that's very useful!

With a dynamically linked build of lxml, I used "ltrace" to see the calls
to libxml2. Looks like you're correct there is only one call to
htmlParseChunk with the whole content (followed by a zero-length call to
terminate the input). But even so I still wasn't able to reproduce it in
pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2
event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and
--sax options are incompatible.

I had more luck with git bisect. Using a dynamically linked build of lxml,
and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out
that the bug was:
- introduced by
https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14
- fixed(?) by
https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0

I hope that's meaningful to you, because I have no idea what are those
commits doing and how could it be related to this bug... The commits sound
related to character encoding, but bad.html is plain ASCII...

Tomi

On Tue, Jan 22, 2019 at 7:56 PM Nick Wellnhofer  wrote:

> On 22/01/2019 19:11, Tomi Belan wrote:
> > I tried to reproduce it with only xmllint as you suggest, but I'm not
> having
> > much luck. It produces correct results with "--html --debug bad.html",
> "--html
> > --debug --stream bad.html", "--html --debug --push bad.html", and
> "--html
> > --debug --sax bad.html".
> >
> > Maybe I'm just not using the right flags - I don't know if lxml uses SAX
> mode,
> > or streaming, etc. But at this point I wouldn't be too surprised if it
> > depended on the size of some internal input buffer that's different in
> lxml vs
> > xmllint. I'd welcome any advice about what else I should try, or how can
> I
> > find out what calls are being made from lxml to libxml2.
>
>  From a quick look at the lxml source, it seems that the `feed` method of
> HTMLParser calls htmlParseChunk, so you should pass `--html --push` to
> xmllint. But if it's a buffer boundary issue, you might have to recreate
> the
> exact chunk sizes to reproduce the problem. lxml seems to split into
> chunks of
> size INT_MAX, meaning a single chunk in most cases. xmllint first passes a
> chunk of 4 bytes, then splits the remaining data into chunks of 4096
> bytes.
> But maybe I'm missing something. To be sure, you could run your Python
> code
> under a debugger like gdb and set a break point on htmlParseChunk. Also
> break
> on htmlCtxtUseOptions to see which parser options are used exactly.
>
> You could also start experimenting with feeding chunks of different sizes
> in
> your Python script or with a small C program that calls htmlParseChunk in
> the
> same way as lxml, presumably writing a single chunk. You could also try to
> add
> 4 bytes somewhere at the beginning of `bad.html` and see if it helps with
> reproducing the issue using xmllint.
>
> > Other than that: It's not ideal, but could you please check if you can
> also
> > reproduce the bug with the first set of commands I posted? Just to
> verify it's
> > not just me.
>
> Yes, I can try.
>
> Nick
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer

On 22/01/2019 19:11, Tomi Belan wrote:
I tried to reproduce it with only xmllint as you suggest, but I'm not having 
much luck. It produces correct results with "--html --debug bad.html", "--html 
--debug --stream bad.html", "--html --debug --push bad.html", and "--html 
--debug --sax bad.html".


Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, 
or streaming, etc. But at this point I wouldn't be too surprised if it 
depended on the size of some internal input buffer that's different in lxml vs 
xmllint. I'd welcome any advice about what else I should try, or how can I 
find out what calls are being made from lxml to libxml2.


From a quick look at the lxml source, it seems that the `feed` method of 
HTMLParser calls htmlParseChunk, so you should pass `--html --push` to 
xmllint. But if it's a buffer boundary issue, you might have to recreate the 
exact chunk sizes to reproduce the problem. lxml seems to split into chunks of 
size INT_MAX, meaning a single chunk in most cases. xmllint first passes a 
chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes. 
But maybe I'm missing something. To be sure, you could run your Python code 
under a debugger like gdb and set a break point on htmlParseChunk. Also break 
on htmlCtxtUseOptions to see which parser options are used exactly.


You could also start experimenting with feeding chunks of different sizes in 
your Python script or with a small C program that calls htmlParseChunk in the 
same way as lxml, presumably writing a single chunk. You could also try to add 
4 bytes somewhere at the beginning of `bad.html` and see if it helps with 
reproducing the issue using xmllint.


Other than that: It's not ideal, but could you please check if you can also 
reproduce the bug with the first set of commands I posted? Just to verify it's 
not just me.


Yes, I can try.

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of
the above command), and got the same results. So I don't think it's a
distro specific problem.

I tried to reproduce it with only xmllint as you suggest, but I'm not
having much luck. It produces correct results with "--html --debug
bad.html", "--html --debug --stream bad.html", "--html --debug --push
bad.html", and "--html --debug --sax bad.html".

Maybe I'm just not using the right flags - I don't know if lxml uses SAX
mode, or streaming, etc. But at this point I wouldn't be too surprised if
it depended on the size of some internal input buffer that's different in
lxml vs xmllint. I'd welcome any advice about what else I should try, or
how can I find out what calls are being made from lxml to libxml2.

Other than that: It's not ideal, but could you please check if you can also
reproduce the bug with the first set of commands I posted? Just to verify
it's not just me.

Tomi

On Tue, Jan 22, 2019 at 5:11 PM Nick Wellnhofer  wrote:

> On 22/01/2019 15:43, Tomi Belan via xml wrote:
> > After a lot of debugging, I determined the problem is in libxml2 and not
> the
> > other libraries in my stack, and that it only seems to happen on version
> > 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor
> in the
> > diff between them, so I am still worried: I don't know if the bug is
> really
> > fixed, or just dormant. I hope you can find the root cause, and maybe
> add a
> > regression test if you do.
>
> I also don't see any directly related changes in either 2.9.8 or 2.9.9.
>
> > This will download
> > the manylinux binary build of lxml 4.2.5, which is statically linked to
> > libxml2 2.9.8.
>
> Are you sure that a pristine 2.9.8 build was used? Maybe there are
> additional
> patches added by a distro?
>
> > I couldn't shorten the file very much, because if I delete even a single
> > character, the bug stops triggering. (Could it be some buffer boundary
> issue?)
>
> Yes, a buffer boundary issue seems likely.
>
> > I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not
> affected. So
> > I believe this is a bug in libxml2 2.9.8 specifically, and not in a
> particular
> > version of lxml.
>
> Did you also try your own build with the official libxml2 2.9.8 sources?
>
> > I hope you can solve the mystery. Please let me know if I can be of any
> help.
>
> It would help if you could reproduce the issue with xmllint and no Python
> code
> involved. git-bisect might also be useful.
>
> Nick
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer

On 22/01/2019 15:43, Tomi Belan via xml wrote:
After a lot of debugging, I determined the problem is in libxml2 and not the 
other libraries in my stack, and that it only seems to happen on version 
2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the 
diff between them, so I am still worried: I don't know if the bug is really 
fixed, or just dormant. I hope you can find the root cause, and maybe add a 
regression test if you do.


I also don't see any directly related changes in either 2.9.8 or 2.9.9.

This will download 
the manylinux binary build of lxml 4.2.5, which is statically linked to 
libxml2 2.9.8.


Are you sure that a pristine 2.9.8 build was used? Maybe there are additional 
patches added by a distro?


I couldn't shorten the file very much, because if I delete even a single 
character, the bug stops triggering. (Could it be some buffer boundary issue?) 


Yes, a buffer boundary issue seems likely.

I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So 
I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular 
version of lxml.


Did you also try your own build with the official libxml2 2.9.8 sources?

I hope you can solve the mystery. Please let me know if I can be of any help. 


It would help if you could reproduce the issue with xmllint and no Python code 
involved. git-bisect might also be useful.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
Hello, authors of libxml2.

I'm using libxml2 to parse HTML and it sometimes produces the wrong result.
In some weird circumstances, when the parser sees "" it won't
close the script tag, but instead it will literally add "" to the
text node and continue parsing the rest of the input verbatim as if it was
script content.

After a lot of debugging, I determined the problem is in libxml2 and not
the other libraries in my stack, and that it only seems to happen on
version 2.9.8. But I don't see any related changes in news.html for 2.9.9,
nor in the diff between them, so I am still worried: I don't know if the
bug is really fixed, or just dormant. I hope you can find the root cause,
and maybe add a regression test if you do.

## How to reproduce:

Test case:
https://gist.github.com/TomiBelan/c9949b6519d115500b742393da61b188

I've uploaded an example html file that exhibits this problem, and a Python
program to show it. It uses libxml2 via the lxml library. This will
download the manylinux binary build of lxml 4.2.5, which is statically
linked to libxml2 2.9.8.
$ virtualenv -p python3 ./venv
$ ./venv/bin/pip install --upgrade pip
$ ./venv/bin/pip install lxml==4.2.5
$ ./venv/bin/python test.py

I couldn't shorten the file very much, because if I delete even a single
character, the bug stops triggering. (Could it be some buffer boundary
issue?) Instead, I replaced most unimportant tags and text nodes with dummy
text.

## Affected versions:

- lxml 4.1.1, which contains libxml2 2.9.7, is not affected.
- lxml 4.2.5, which contains libxml2 2.9.8, IS affected.
- lxml 4.3.0, which contains libxml2 2.9.9, is not affected (at least for
this particular html file).

I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected.
So I believe this is a bug in libxml2 2.9.8 specifically, and not in a
particular version of lxml. I used this command:

STATIC_DEPS=true LDFLAGS='-flto -fPIC' CFLAGS='-O3 -g1 -pipe -fPIC -flto'
LIBXML2_VERSION=2.9.9 ./venv/bin/pip install --no-binary :all: -vvv
lxml==4.2.5

My particular test case doesn't trigger the bug in 2.9.9, but I don't know
if that's because the bug is really fixed, or some constants/offsets have
changed and now it triggers on other html files.

I hope you can solve the mystery. Please let me know if I can be of any
help. And thanks for reading!

Tomi
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml