On Sat, 2006-08-05 at 14:12 -0400, Joey Hess wrote: > Would just calling Encode::decode_utf8 on the input string in Expat.pm > be the simplest fix?
I'm not sure, but I think not. First of all, in the case I reported, the parser reads directly from an input stream. The data is then not touched by Expat.pm, but handled internally in Expat.xs. It seems to me that the reported overflow can not be triggered in the case where a string (as opposed to a stream) is passed to XML::Parser. It also seems to me that XML parsing on a string will proceed correctly regardless of whether the string is logical Unicode or raw UTF8, since both kinds of strings are essentially the same at the level of Perl internals. Secondly, the cause of the reported stream parsing problem is not that Expat does not handle UTF8 data; it handles that fine. The problem is that it *expects* raw UTF8 bytes but, in my case, gets logical Unicode characters instead. It breaks on that because of an invalid assumption in the buffer management code. I dived into Expat.xs again and believe I have a simple fix that stops the buffer management from overflowing the heap. Due to Perl's identical internal treatment of utf8 and Unicode, this should be all that is necessary to enable correct parsing of Unicode streams. My patch is attached. Basic testing suggests that it works as intended. But I have very little experience with Perl XS coding, so I would recommend that somebody reviews this before it is applied anywhere. Thanks for pushing this forward a bit; we should get it fixed. Joris. PS. (and slightly off-topic) My personal opinion is that Perl has utterly messed up Unicode handling. The documentation uses the terms "Unicode" and "UTF8" as if they were interchangable. In fact, and as we see with this bug, there is a very important conceptual difference between "a string containing N raw utf8 bytes" and "a string containing M logical Unicode characters".
--- XML-Parser-2.34/Expat/Expat.xs.orig 2003-07-28 16:41:10.000000000 +0200 +++ XML-Parser-2.34/Expat/Expat.xs 2006-08-07 10:37:40.000000000 +0200 @@ -289,11 +289,10 @@ SV * tbuff; SV * tsiz; char * linebuff; STRLEN lblen; STRLEN br = 0; - int buffsize; int done = 0; int ret = 1; char * msg = NULL; CallbackVector * cbv; char *buff = (char *) 0; @@ -334,37 +333,31 @@ && strnEQ(++chk, cbv->delim + 1, cbv->delimlen - 1)) lblen -= cbv->delimlen + 1; } PUTBACK ; - buffsize = lblen; done = lblen == 0; } else { tbuff = newSV(0); tsiz = newSViv(BUFSIZE); - buffsize = BUFSIZE; } while (! done) { - char *buffer = XML_GetBuffer(parser, buffsize); - - if (! buffer) - croak("Ran out of memory for input buffer"); + char *buffer, *tb; SAVETMPS; if (cbv->delim) { - Copy(linebuff, buffer, lblen, char); + tb = linebuff; br = lblen; done = 1; } else { int cnt; SV * rdres; - char * tb; PUSHMARK(SP); EXTEND(SP, 3); PUSHs(ioref); PUSHs(tbuff); @@ -382,18 +375,26 @@ if (! SvOK(rdres)) croak("read error"); tb = SvPV(tbuff, br); - if (br > 0) - Copy(tb, buffer, br, char); - else + /* br == number of bytes read from stream + Note that it is possible that br > BUFSIZE if the input stream + is decoding a non-ASCII source. */ + if (br <= 0) done = 1; PUTBACK ; } + buffer = XML_GetBuffer(parser, br); + if (! buffer) + croak("Ran out of memory for input buffer"); + + if (br > 0) + Copy(tb, buffer, br, char); + ret = XML_ParseBuffer(parser, br, done); SPAGAIN; /* resync local SP in case callbacks changed global stack */ if (! ret)