Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On Wed, Jan 23, 2019 at 12:55 PM Nick Wellnhofer wrote: > The commit obviously also affected documents that didn't need encoding > conversion. I didn't realize that. Aha! I noticed that the chromium link you sent mentions a >32KB string which gets converted to a >64KB string, which sounded suspiciously similar. Looks like lxml's feed() function [1] is doing the same thing. I don't know too much about Python's C API, but [2] [3] suggests lxml is using a deprecated macro and giving libxml2 a multibyte buffer even though the input would fit into pure ASCII. This explains why it behaved differently than xmllint. [1] https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi#L1242 [2] https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python [3] https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_DATA I also noticed that feed() is doing something special with the first 4 bytes, giving them to _htmlCtxtResetPush() instead of htmlParseChunk(). So the discussion about buffer boundaries might be slightly incorrect. At least we know that the issue is isolated > to 2.9.8. Thanks for your efforts! > Yes, thank you. Now it's clear that my immediate issue is solved and version 2.9.9 works. So I probably won't look into this much further. I guess it's up to you to decide what to do next, and if any libxml2 changes are needed. It would be good to add some tests to decrease the likelihood that this issue or something similar happens again. For that, you might still need to isolate the root cause further, and create a pure C test case. (Maybe based on a test case from chromium instead of mine.) But of course it's up to you to determine the priority of that. Thanks again for your help, and good luck if you decide to continue. Tomi ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
Thanks, that's very useful! With a dynamically linked build of lxml, I used "ltrace" to see the calls to libxml2. Looks like you're correct there is only one call to htmlParseChunk with the whole content (followed by a zero-length call to terminate the input). But even so I still wasn't able to reproduce it in pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2 event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and --sax options are incompatible. I had more luck with git bisect. Using a dynamically linked build of lxml, and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out that the bug was: - introduced by https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14 - fixed(?) by https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0 I hope that's meaningful to you, because I have no idea what are those commits doing and how could it be related to this bug... The commits sound related to character encoding, but bad.html is plain ASCII... Tomi On Tue, Jan 22, 2019 at 7:56 PM Nick Wellnhofer wrote: > On 22/01/2019 19:11, Tomi Belan wrote: > > I tried to reproduce it with only xmllint as you suggest, but I'm not > having > > much luck. It produces correct results with "--html --debug bad.html", > "--html > > --debug --stream bad.html", "--html --debug --push bad.html", and > "--html > > --debug --sax bad.html". > > > > Maybe I'm just not using the right flags - I don't know if lxml uses SAX > mode, > > or streaming, etc. But at this point I wouldn't be too surprised if it > > depended on the size of some internal input buffer that's different in > lxml vs > > xmllint. I'd welcome any advice about what else I should try, or how can > I > > find out what calls are being made from lxml to libxml2. > > From a quick look at the lxml source, it seems that the `feed` method of > HTMLParser calls htmlParseChunk, so you should pass `--html --push` to > xmllint. But if it's a buffer boundary issue, you might have to recreate > the > exact chunk sizes to reproduce the problem. lxml seems to split into > chunks of > size INT_MAX, meaning a single chunk in most cases. xmllint first passes a > chunk of 4 bytes, then splits the remaining data into chunks of 4096 > bytes. > But maybe I'm missing something. To be sure, you could run your Python > code > under a debugger like gdb and set a break point on htmlParseChunk. Also > break > on htmlCtxtUseOptions to see which parser options are used exactly. > > You could also start experimenting with feeding chunks of different sizes > in > your Python script or with a small C program that calls htmlParseChunk in > the > same way as lxml, presumably writing a single chunk. You could also try to > add > 4 bytes somewhere at the beginning of `bad.html` and see if it helps with > reproducing the issue using xmllint. > > > Other than that: It's not ideal, but could you please check if you can > also > > reproduce the bug with the first set of commands I posted? Just to > verify it's > > not just me. > > Yes, I can try. > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of the above command), and got the same results. So I don't think it's a distro specific problem. I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html". Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, or streaming, etc. But at this point I wouldn't be too surprised if it depended on the size of some internal input buffer that's different in lxml vs xmllint. I'd welcome any advice about what else I should try, or how can I find out what calls are being made from lxml to libxml2. Other than that: It's not ideal, but could you please check if you can also reproduce the bug with the first set of commands I posted? Just to verify it's not just me. Tomi On Tue, Jan 22, 2019 at 5:11 PM Nick Wellnhofer wrote: > On 22/01/2019 15:43, Tomi Belan via xml wrote: > > After a lot of debugging, I determined the problem is in libxml2 and not > the > > other libraries in my stack, and that it only seems to happen on version > > 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor > in the > > diff between them, so I am still worried: I don't know if the bug is > really > > fixed, or just dormant. I hope you can find the root cause, and maybe > add a > > regression test if you do. > > I also don't see any directly related changes in either 2.9.8 or 2.9.9. > > > This will download > > the manylinux binary build of lxml 4.2.5, which is statically linked to > > libxml2 2.9.8. > > Are you sure that a pristine 2.9.8 build was used? Maybe there are > additional > patches added by a distro? > > > I couldn't shorten the file very much, because if I delete even a single > > character, the bug stops triggering. (Could it be some buffer boundary > issue?) > > Yes, a buffer boundary issue seems likely. > > > I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not > affected. So > > I believe this is a bug in libxml2 2.9.8 specifically, and not in a > particular > > version of lxml. > > Did you also try your own build with the official libxml2 2.9.8 sources? > > > I hope you can solve the mystery. Please let me know if I can be of any > help. > > It would help if you could reproduce the issue with xmllint and no Python > code > involved. git-bisect might also be useful. > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
[xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
Hello, authors of libxml2. I'm using libxml2 to parse HTML and it sometimes produces the wrong result. In some weird circumstances, when the parser sees "" it won't close the script tag, but instead it will literally add "" to the text node and continue parsing the rest of the input verbatim as if it was script content. After a lot of debugging, I determined the problem is in libxml2 and not the other libraries in my stack, and that it only seems to happen on version 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the diff between them, so I am still worried: I don't know if the bug is really fixed, or just dormant. I hope you can find the root cause, and maybe add a regression test if you do. ## How to reproduce: Test case: https://gist.github.com/TomiBelan/c9949b6519d115500b742393da61b188 I've uploaded an example html file that exhibits this problem, and a Python program to show it. It uses libxml2 via the lxml library. This will download the manylinux binary build of lxml 4.2.5, which is statically linked to libxml2 2.9.8. $ virtualenv -p python3 ./venv $ ./venv/bin/pip install --upgrade pip $ ./venv/bin/pip install lxml==4.2.5 $ ./venv/bin/python test.py I couldn't shorten the file very much, because if I delete even a single character, the bug stops triggering. (Could it be some buffer boundary issue?) Instead, I replaced most unimportant tags and text nodes with dummy text. ## Affected versions: - lxml 4.1.1, which contains libxml2 2.9.7, is not affected. - lxml 4.2.5, which contains libxml2 2.9.8, IS affected. - lxml 4.3.0, which contains libxml2 2.9.9, is not affected (at least for this particular html file). I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular version of lxml. I used this command: STATIC_DEPS=true LDFLAGS='-flto -fPIC' CFLAGS='-O3 -g1 -pipe -fPIC -flto' LIBXML2_VERSION=2.9.9 ./venv/bin/pip install --no-binary :all: -vvv lxml==4.2.5 My particular test case doesn't trigger the bug in 2.9.9, but I don't know if that's because the bug is really fixed, or some constants/offsets have changed and now it triggers on other html files. I hope you can solve the mystery. Please let me know if I can be of any help. And thanks for reading! Tomi ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml