On Sat, 2006-08-05 at 14:12 -0400, Joey Hess wrote:
> Would just calling Encode::decode_utf8 on the input string in Expat.pm
> be the simplest fix?

I'm not sure, but I think not.
First of all, in the case I reported, the parser reads directly from an
input stream. The data is then not touched by Expat.pm, but handled
internally in Expat.xs.

It seems to me that the reported overflow can not be triggered in the
case where a string (as opposed to a stream) is passed to XML::Parser.
It also seems to me that XML parsing on a string will proceed correctly
regardless of whether the string is logical Unicode or raw UTF8, since
both kinds of strings are essentially the same at the level of Perl
internals.

Secondly, the cause of the reported stream parsing problem is not that
Expat does not handle UTF8 data; it handles that fine. The problem is
that it *expects* raw UTF8 bytes but, in my case, gets logical Unicode
characters instead. It breaks on that because of an invalid assumption
in the buffer management code.

I dived into Expat.xs again and believe I have a simple fix that stops
the buffer management from overflowing the heap. Due to Perl's identical
internal treatment of utf8 and Unicode, this should be all that is
necessary to enable correct parsing of Unicode streams.

My patch is attached.
Basic testing suggests that it works as intended. But I have very little
experience with Perl XS coding, so I would recommend that somebody
reviews this before it is applied anywhere.

Thanks for pushing this forward a bit; we should get it fixed.

Joris.

PS. (and slightly off-topic) My personal opinion is that Perl has
utterly messed up Unicode handling. The documentation uses the terms
"Unicode" and "UTF8" as if they were interchangable. In fact, and as we
see with this bug, there is a very important conceptual difference
between "a string containing N raw utf8 bytes" and "a string containing
M logical Unicode characters".

--- XML-Parser-2.34/Expat/Expat.xs.orig	2003-07-28 16:41:10.000000000 +0200
+++ XML-Parser-2.34/Expat/Expat.xs	2006-08-07 10:37:40.000000000 +0200
@@ -289,11 +289,10 @@
   SV *		tbuff;
   SV *		tsiz;
   char *	linebuff;
   STRLEN	lblen;
   STRLEN	br = 0;
-  int		buffsize;
   int		done = 0;
   int		ret = 1;
   char *	msg = NULL;
   CallbackVector * cbv;
   char		*buff = (char *) 0;
@@ -334,37 +333,31 @@
 	  && strnEQ(++chk, cbv->delim + 1, cbv->delimlen - 1))
 	lblen -= cbv->delimlen + 1;
     }
 
     PUTBACK ;
-    buffsize = lblen;
     done = lblen == 0;
   }
   else {
     tbuff = newSV(0);
     tsiz = newSViv(BUFSIZE);
-    buffsize = BUFSIZE;
   }
 
   while (! done)
     {
-      char *buffer = XML_GetBuffer(parser, buffsize);
-
-      if (! buffer)
-	croak("Ran out of memory for input buffer");
+      char *buffer, *tb;
 
       SAVETMPS;
 
       if (cbv->delim) {
-	Copy(linebuff, buffer, lblen, char);
+	tb = linebuff;
 	br = lblen;
 	done = 1;
       }
       else {
 	int cnt;
 	SV * rdres;
-	char * tb;
 
 	PUSHMARK(SP);
 	EXTEND(SP, 3);
 	PUSHs(ioref);
 	PUSHs(tbuff);
@@ -382,18 +375,26 @@
 
 	if (! SvOK(rdres))
 	  croak("read error");
 
 	tb = SvPV(tbuff, br);
-	if (br > 0)
-	  Copy(tb, buffer, br, char);
-	else
+	/* br == number of bytes read from stream
+	   Note that it is possible that br > BUFSIZE if the input stream
+	   is decoding a non-ASCII source. */
+	if (br <= 0)
 	  done = 1;
 
 	PUTBACK ;
       }
 
+      buffer = XML_GetBuffer(parser, br);
+      if (! buffer)
+	croak("Ran out of memory for input buffer");
+
+      if (br > 0)
+        Copy(tb, buffer, br, char);
+
       ret = XML_ParseBuffer(parser, br, done);
 
       SPAGAIN; /* resync local SP in case callbacks changed global stack */
 
       if (! ret)

Reply via email to