Re: [xml] Patch to fix ICU flush and pivot buffer
Yes, I will update chromium with this as per https://cs.chromium.org/chromium/src/third_party/libxml/chromium/roll.py On Thu, Nov 9, 2017 at 10:35 AM, Jungshik Shin (신정식, 申政湜) < js...@chromium.org> wrote: > Thank you, Joel and Nick ! > > Joel: I guess you're gonna roll libxml in the Chromium tree to a version > including these changes. > > Jungshik > > 2017-11-08 15:22 GMT-08:00 Joel Hockey : > >> Thanks Nick. Nice work with the test. >> >> >> >> On Sun, Nov 5, 2017 at 2:04 AM, Nick Wellnhofer >> wrote: >> >>> On 26/10/2017 03:17, Joel Hockey wrote: >>> I've updated the patch using git format-patch. >>> >>> Thanks for the updated patch. Applied here: >>> https://git.gnome.org/browse/libxml2/commit/?id=0b19f236a263 >>> a7b0acacd4ea84dc7237303ee3d9 >>> >>> The original bug found by fuzzer only relates to UTF8 decoding, so using Shift-JIS or anything else wont help. >>> >>> Why not? My reasoning was that ICU uses the same code path for all >>> variable-width encodings. I simply converted your test file to EUC-JP and >>> it turns out that this triggers the bug as well: >>> >>> https://git.gnome.org/browse/libxml2/commit/?id=72182550926d >>> 31ad17357bd3ed69e49d7e69df02 >>> >>> Nick >>> >> >> > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
Thanks Nick. Nice work with the test. On Sun, Nov 5, 2017 at 2:04 AM, Nick Wellnhofer wrote: > On 26/10/2017 03:17, Joel Hockey wrote: > >> I've updated the patch using git format-patch. >> > > Thanks for the updated patch. Applied here: https://git.gnome.org/browse/l > ibxml2/commit/?id=0b19f236a263a7b0acacd4ea84dc7237303ee3d9 > > The original bug found by fuzzer only relates to UTF8 decoding, so using >> Shift-JIS or anything else wont help. >> > > Why not? My reasoning was that ICU uses the same code path for all > variable-width encodings. I simply converted your test file to EUC-JP and > it turns out that this triggers the bug as well: > > https://git.gnome.org/browse/libxml2/commit/?id=72182550926d > 31ad17357bd3ed69e49d7e69df02 > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
On 26/10/2017 03:17, Joel Hockey wrote: I've updated the patch using git format-patch. Thanks for the updated patch. Applied here: https://git.gnome.org/browse/libxml2/commit/?id=0b19f236a263a7b0acacd4ea84dc7237303ee3d9 The original bug found by fuzzer only relates to UTF8 decoding, so using Shift-JIS or anything else wont help. Why not? My reasoning was that ICU uses the same code path for all variable-width encodings. I simply converted your test file to EUC-JP and it turns out that this triggers the bug as well: https://git.gnome.org/browse/libxml2/commit/?id=72182550926d31ad17357bd3ed69e49d7e69df02 Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
Nick, how does that updated patch look? Are you happy to take it? On Thu, Oct 26, 2017 at 10:03 PM, Joel Hockey wrote: > >> Does libxml treat 'UTF8' (without dash/hyphen) as UTF-8 ? If not, 'UTF8' >> can be used for both ICU and iconv. >> > > Yes. https://cs.chromium.org/chromium/src/third_party/ > libxml/src/parser.c?l=10329&rcl=b54509c3db126e5a3ed9b84fa70df1f821b1fd3e > > > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
> > > Does libxml treat 'UTF8' (without dash/hyphen) as UTF-8 ? If not, 'UTF8' > can be used for both ICU and iconv. > Yes. https://cs.chromium.org/chromium/src/third_party/libxml/src/parser.c?l=10329&rcl=b54509c3db126e5a3ed9b84fa70df1f821b1fd3e ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
I've updated the patch using git format-patch. * reverted public interface xmlCharEncInFunc. It now calls xmlEncInputChunk with flush=1 on all calls as suggested. Always setting flush=TRUE here makes sense since this is a one-shot conversion of the entire buffer. * Moved the pivot buf reset to only happen on success (suggestion from Markus). * removed test/icu_parse_test.xml from patch. This file is attached separately to this email. The original bug found by fuzzer only relates to UTF8 decoding, so using Shift-JIS or anything else wont help. I can't think of any way that this test can simultaneously - be effective to test the bug when libxml is compiled with icu - still parse correctly and pass when libxml is compiled with iconv The trick is that we need an encoding which will: - not be handled natively by libxml - be recognized by icu as UTF8 - be recognized by iconv as UTF8 On Thu, Oct 26, 2017 at 3:21 AM, Nick Wellnhofer wrote: > On 25/10/2017 17:40, Markus Scherer wrote: > >> On Wed, Oct 25, 2017 at 4:02 AM, Nick Wellnhofer > The patch changes public function xmlCharEncInFunc but this function isn't >> used internally anymore (since commit a78d8036 from 2012). It might >> still >> be used in client code that wants to use libxml2's character >> conversion >> facilities, though. Maybe it's better to remove the `flush` parameter >> and >> always call xmlEncInputChunk with `flush` set to 1. This should at >> least >> allow one-shot character conversions without breaking the public API. >> >> The ICU conversion functions must be called with flush=FALSE before the >> end of the stream, and with flush=TRUE at the end of the stream. >> > > Yes, but I'm only talking about xmlCharEncInFunc which isn't used > internally. > > Nick > > From f495b5546927032fb5b3988d66949d3d1b735aa9 Mon Sep 17 00:00:00 2001 From: Joel Hockey Date: Wed, 25 Oct 2017 18:11:12 -0700 Subject: [PATCH] Fixed ICU to set flush correctly and provide pivot buffer. By always setting flush=TRUE when doing multiple reads, ICU will not correctly handle truncated utf8 chars across read boundaries. The fix is to set flush=TRUE only on final read, and to provide a pivot buffer which is maintained by libxml between calls to ucnv_convertEx. --- encoding.c| 46 +- include/libxml/encoding.h | 5 + testapi.c | 2 +- 3 files changed, 31 insertions(+), 22 deletions(-) diff --git a/encoding.c b/encoding.c index cd019c51..5dac8a14 100644 --- a/encoding.c +++ b/encoding.c @@ -110,6 +110,9 @@ openIcuConverter(const char* name, int toUnicode) if (conv == NULL) return NULL; + conv->pivot_source = conv->pivot_buf; + conv->pivot_target = conv->pivot_buf; + conv->uconv = ucnv_open(name, &status); if (U_FAILURE(status)) goto error; @@ -1850,6 +1853,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen, * @outlen: the length of @out * @in: a pointer to an array of ISO Latin 1 chars * @inlen: the length of @in + * @flush: if true, indicates end of input * * Returns 0 if success, or * -1 by lack of space, or @@ -1863,7 +1867,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen, */ static int xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen, -const unsigned char *in, int *inlen) { +const unsigned char *in, int *inlen, int flush) { const char *ucv_in = (const char *) in; char *ucv_out = (char *) out; UErrorCode err = U_ZERO_ERROR; @@ -1873,33 +1877,31 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen, return(-1); } -/* - * TODO(jungshik) - * 1. is ucnv_convert(To|From)Algorithmic better? - * 2. had we better use an explicit pivot buffer? - * 3. error returned comes from 'fromUnicode' only even - *when toUnicode is true ! - */ if (toUnicode) { /* encoding => UTF-16 => UTF-8 */ ucnv_convertEx(cd->utf8, cd->uconv, &ucv_out, ucv_out + *outlen, - &ucv_in, ucv_in + *inlen, NULL, NULL, NULL, NULL, - 0, TRUE, &err); + &ucv_in, ucv_in + *inlen, cd->pivot_buf, + &cd->pivot_source, &cd->pivot_target, + cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, &err); } else { /* UTF-8 => UTF-16 => encoding */ ucnv_convertEx(cd->uconv, cd->utf8, &ucv_out, ucv_out + *outlen, - &ucv_in, ucv_in + *inlen, NULL, NULL, NULL, NULL, - 0, TRUE, &err); + &ucv_in, ucv_in + *inlen, cd->pivot_buf, + &cd->pivot_source, &cd->pivot_target, + cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, &err); } *inlen = ucv_in - (const char*) in; *outlen = ucv_out - (char *) out; -if (U_SUCCESS(err)) +if (U_
Re: [xml] Patch to fix ICU flush and pivot buffer
On 25/10/2017 17:40, Markus Scherer wrote: On Wed, Oct 25, 2017 at 4:02 AM, Nick Wellnhofer The patch changes public function xmlCharEncInFunc but this function isn't used internally anymore (since commit a78d8036 from 2012). It might still be used in client code that wants to use libxml2's character conversion facilities, though. Maybe it's better to remove the `flush` parameter and always call xmlEncInputChunk with `flush` set to 1. This should at least allow one-shot character conversions without breaking the public API. The ICU conversion functions must be called with flush=FALSE before the end of the stream, and with flush=TRUE at the end of the stream. Yes, but I'm only talking about xmlCharEncInFunc which isn't used internally. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Patch to fix ICU flush and pivot buffer
On 25/10/2017 10:32, Joel Hockey wrote: This patch fixes those issues. Looks good. The patch changes public function xmlCharEncInFunc but this function isn't used internally anymore (since commit a78d8036 from 2012). It might still be used in client code that wants to use libxml2's character conversion facilities, though. Maybe it's better to remove the `flush` parameter and always call xmlEncInputChunk with `flush` set to 1. This should at least allow one-shot character conversions without breaking the public API. The test sets encoding to "UTF8-". This encoding is chosen since it is not recognized by libxml and forces the decoding to be done by ICU which recognizes this encoding as UTF8. Unfortunately, the test always fails when using iconv since iconv does not recognize this encoding. Since iconv is the default in libxml, I understand that it may not make sense to always run this testcase, or to include it at all. It should be possible to write a test that works with both iconv and ICU and triggers the bug with a different variable-width encoding like Shift-JIS. If it is any help, I can send this patch as a pull request to the libxml github repo, or I can also create a libxml bug. It looks like libxml preference is to take patches on the mailing list. I prefer patches created with `git format-patch` that include author information and a commit message, either via the mailing list or Bugzilla. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
[xml] Patch to fix ICU flush and pivot buffer
Hi, The chromium team have recently detected a fuzz-testing bug in libxml / ICU where UTF8 chars can be decoded incorrectly. See http://crbug.com/722420. The root cause of this problem is that libxml is calling ICU ucnv_convertEx with incorrect params. It is always setting flush to TRUE. This param should only be set to true for the last call when reading an input. Also, when calling ucnv_convertEx multiple times (with flush=FALSE), the caller must provide a pivot buffer which is maintained between calls. This patch fixes those issues. The patch includes test/icu_parse_test.xml to reproduce the error. ./configure --with-icu --with-iconv=no make runtest ./runtest The test sets encoding to "UTF8-". This encoding is chosen since it is not recognized by libxml and forces the decoding to be done by ICU which recognizes this encoding as UTF8. Unfortunately, the test always fails when using iconv since iconv does not recognize this encoding. Since iconv is the default in libxml, I understand that it may not make sense to always run this testcase, or to include it at all. This patch has been reviewed by Markus Scherer (maintainer of ICU), and Jungshik in the chromium team who worked on the original integration of libxml and ICU in 2007. http://crosreview.com/729616. The patch is ready to submit against the chromium local copy of libxml, but it is preferable to have this accepted into libxml if you are happy with it, and we can then take the patch from libxml and stay in sync. If it is any help, I can send this patch as a pull request to the libxml github repo, or I can also create a libxml bug. It looks like libxml preference is to take patches on the mailing list. Thanks, Joel diff --git a/encoding.c b/encoding.c index cd019c51..617e1ed7 100644 --- a/encoding.c +++ b/encoding.c @@ -110,6 +110,9 @@ openIcuConverter(const char* name, int toUnicode) if (conv == NULL) return NULL; + conv->pivot_source = conv->pivot_buf; + conv->pivot_target = conv->pivot_buf; + conv->uconv = ucnv_open(name, &status); if (U_FAILURE(status)) goto error; @@ -1850,6 +1853,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen, * @outlen: the length of @out * @in: a pointer to an array of ISO Latin 1 chars * @inlen: the length of @in + * @flush: if true, indicates end of input * * Returns 0 if success, or * -1 by lack of space, or @@ -1863,7 +1867,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen, */ static int xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen, -const unsigned char *in, int *inlen) { +const unsigned char *in, int *inlen, int flush) { const char *ucv_in = (const char *) in; char *ucv_out = (char *) out; UErrorCode err = U_ZERO_ERROR; @@ -1873,33 +1877,30 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen, return(-1); } -/* - * TODO(jungshik) - * 1. is ucnv_convert(To|From)Algorithmic better? - * 2. had we better use an explicit pivot buffer? - * 3. error returned comes from 'fromUnicode' only even - *when toUnicode is true ! - */ if (toUnicode) { /* encoding => UTF-16 => UTF-8 */ ucnv_convertEx(cd->utf8, cd->uconv, &ucv_out, ucv_out + *outlen, - &ucv_in, ucv_in + *inlen, NULL, NULL, NULL, NULL, - 0, TRUE, &err); + &ucv_in, ucv_in + *inlen, cd->pivot_buf, + &cd->pivot_source, &cd->pivot_target, + cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, &err); } else { /* UTF-8 => UTF-16 => encoding */ ucnv_convertEx(cd->uconv, cd->utf8, &ucv_out, ucv_out + *outlen, - &ucv_in, ucv_in + *inlen, NULL, NULL, NULL, NULL, - 0, TRUE, &err); + &ucv_in, ucv_in + *inlen, cd->pivot_buf, + &cd->pivot_source, &cd->pivot_target, + cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, &err); } *inlen = ucv_in - (const char*) in; *outlen = ucv_out - (char *) out; +/* reset pivot buffer if this is the last call for input (flush==TRUE) */ +if (flush) +cd->pivot_source = cd->pivot_target = cd->pivot_buf; if (U_SUCCESS(err)) return 0; if (err == U_BUFFER_OVERFLOW_ERROR) return -1; if (err == U_INVALID_CHAR_FOUND || err == U_ILLEGAL_CHAR_FOUND) return -2; -/* if (err == U_TRUNCATED_CHAR_FOUND) */ return -3; } #endif /* LIBXML_ICU_ENABLED */ @@ -1912,7 +1913,7 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen, static int xmlEncInputChunk(xmlCharEncodingHandler *handler, unsigned char *out, - int *outlen, const unsigned char *in, int *inlen) { + int *outlen, const unsigned char *in, int *inlen, int flush) { int ret;