Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
Thanks, that's very useful! With a dynamically linked build of lxml, I used "ltrace" to see the calls to libxml2. Looks like you're correct there is only one call to htmlParseChunk with the whole content (followed by a zero-length call to terminate the input). But even so I still wasn't able to re

Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer
On 22/01/2019 19:11, Tomi Belan wrote: I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html".

Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of the above command), and got the same results. So I don't think it's a distro specific problem. I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html

Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer
On 22/01/2019 15:43, Tomi Belan via xml wrote: After a lot of debugging, I determined the problem is in libxml2 and not the other libraries in my stack, and that it only seems to happen on version 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the diff between them, s

[xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
Hello, authors of libxml2. I'm using libxml2 to parse HTML and it sometimes produces the wrong result. In some weird circumstances, when the parser sees "" it won't close the script tag, but instead it will literally add "" to the text node and continue parsing the rest of the input verbatim as if