RE: XMLStreamReader corrupting data

Michael Glavassevich Tue, 04 Nov 2014 06:52:50 -0800

You may have found a bug in OpenJDK, but OpenJDK != Xerces. What you're 
looking at isn't the Apache codebase. We have no influence over changes 
made to OpenJDK.


Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

"Michel, Victor" <[email protected]> wrote on 11/03/2014 10:03:50 PM:

> Here are my findings, which are pretty much a dump of what my 
> debugger tells me when running the sample I posted.
> 
> The XML:
> <?xml version="1.0" encoding="UTF-8"?>
> <He likes="rugs" because="they really tie the room together"/>
> 
> For the start element "He", we first process the "likes" attribute, 
> and extract its value:
> Line 437:
> http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
> tmpStr is declared. It is passed as first argument of 
> scanAttributeValue on the next line (438)
> 
> That invocation goes on line 835 of XMLScanner:
> http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLScanner.java.html
> We're in scanAttributeValue.
> "fEntityScanner.scanLiteral(quote, value)" is invoked
> where "value" is the tmpStr variable above
> 
> then, on line 1156:
> http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLEntityScanner.java.html
> "content.setValues(fCurrentEntity.ch, offset, length);"
> So, tmpStr is now referencing the internal buffer of the current 
> entity, with an offset and a length (offset=0 and length=4 in this 
> case, according to my debugger)
> 
> Line 560:
> http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
> attributes now references tmpStr, which points to the internal entity 
buffer
> 
> Then, the flow continues to:
> Line 254 (next iteration of the loop) we parse the next attribute 
"because"
> http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
> scanAttribute is invoked, which in turns invokes scanQName:
> 
> And then, the corruption actually happens here, on line 779:
> http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/
> org/apache/xerces/internal/impl/XMLEntityScanner.java.html
> The first character of the internal buffer is overwritten - but in 
> this example, this character is actually part of the XMLString which
> is supposed to contain the previous attribute value (because 
> offset=0). So, "rugs" is changed into "bugs"
> 
> >> So, every time an XMLString with offset=0 is created for an 
> attribute value, its first character is going to be overwritten by 
> the first character of the next attribute name, if any.
> 
> I hope this is convincing! I don't know where the actual regression 
> is, but it really seems to me that the bug is in Xerces, since this 
> execution flow does not leave Xerces code before the corruption happens.
> 
> 
> Thanks,
> 
> Victor Michel
> Amazon Web Services
> 
> 
> -----Original Message-----
> From: Michel, Victor 
> Sent: Monday, November 03, 2014 4:16 PM
> To: '[email protected]'
> Subject: RE: XMLStreamReader corrupting data
> 
> I should have named this email thread " XMLEntityScanner/
> XMLEntityManager corrupting data" - sorry for the confusion.
> My debugger points me to these two classes, even though I have yet 
> been unable to pinpoint the bug in the code
> 
> The repro case I've sent is fairly self-explanatory - the "b" from 
> "because" ends up overwriting the "r" of "rugs".
> 
> I used a custom implementation of InputStream in the repro case I 
> sent (to keep it small), but I have seen the bug happen with other 
> implementations of InputStream and much larger XML files.
> The corruption happens silently, which makes that bug pretty tricky to 
detect
> 
> Thanks,
> 
> Victor Michel
> Amazon Web Services
> 
> 
> -----Original Message-----
> From: Michel, Victor
> Sent: Monday, November 03, 2014 10:52 AM
> To: [email protected]
> Subject: RE: XMLStreamReader corrupting data
> 
> Hi,
> 
> Thanks for the answer
> XMLStreamReader is implemented by:
> > public class XMLStreamReaderImpl implements 
> > javax.xml.stream.XMLStreamReader  (in package 
> > com.sun.org.apache.xerces.internal.impl )
> It relies heavily on XMLEntityScanner, which is, I believe, the 
> cause of the bug.
> 
> XMLEntityScanner seems to be part of the Xerces library https://
> xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/impl/
> XMLEntityScanner.html
> 
> Maybe I misunderstood something? In any case, I have filed a bug on 
> the Oracle website
> 
> Thanks,
> 
> Victor
> 
> -----Original Message-----
> From: Michael Glavassevich [mailto:[email protected]]
> Sent: Monday, November 03, 2014 8:30 AM
> To: [email protected]
> Subject: Re: XMLStreamReader corrupting data
> 
> Hello,
> 
> Apache Xerces does not contain an implementation of the 
> XMLStreamReader interface. The component you're using would have 
> been developed by Oracle/Sun and has not been contributed to Apache.
> We wouldn't know anything about the problem you're experiencing with
> StAX. Probably better to ask your question on one of the JDK forums.
> 
> Thanks.
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: [email protected]
> E-mail: [email protected]
> 
> "Michel, Victor" <[email protected]> wrote on 11/03/2014 03:12:20 AM:
> 
> > Hi all,
> > 
> > I'd like to report something that looks like a bug in the version of 
> > Xerces included in JRE 7u71/7u72/8u20/8u25 The StAX API seems to 
> > produce corrupted data, depending on how many bytes the underlying 
> > InputStream is actually reading at each invocation of read(byte[], 
> > int, int)
> > 
> > The following repro case will lead to different results depending on 
> > the version of the JRE. Am I doing something wrong?
> > 
> > Thanks,
> > 
> > Victor
> > 
> > ------
> > 
> > import java.io.ByteArrayInputStream;
> > import java.io.FilterInputStream;
> > import java.io.IOException;
> > import java.io.InputStream;
> > import java.nio.charset.Charset;
> > 
> > import javax.xml.stream.XMLInputFactory; import 
> > javax.xml.stream.XMLStreamReader;
> > 
> > /*
> >  * Correct output (7u67,8u11)
> >  * rugs
> >  *
> >  * Incorrect output (7u71,7u72,8u20,8u25)
> >  * bugs
> >  */
> > public class XmlReaderBug {
> > 
> >     private static final int BYTES_PER_READ = 6;
> > 
> >     private static final String XML =
> >         "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
> >         "<He likes=\"rugs\" because=\"they really tie the room
> together\"/>";
> > 
> >     public static void main(String[] args) throws Exception {
> >         final InputStream xmlStream = new ByteArrayInputStream 
> > (XML.getBytes(Charset.forName("UTF-8")));
> >         final InputStream throttledXmlStream = new 
> > ThrottledInputStream(xmlStream, BYTES_PER_READ);
> > 
> >         final XMLInputFactory xmlFactory =
> XMLInputFactory.newInstance();
> >         final XMLStreamReader xmlStreamReader = 
> > xmlFactory.createXMLStreamReader(throttledXmlStream);
> >         xmlStreamReader.next();
> > 
> >         // bugs or rugs?
> >         System.out.println(xmlStreamReader.getAttributeValue(null,
> "likes"));
> >     }
> > 
> >     // An InputStream implementation that limits the number of bytes 
> > read by read(byte[], int, int)
> >     private static class ThrottledInputStream extends 
> > FilterInputStream
> {
> >         private final int bytesPerRead;
> > 
> >         public ThrottledInputStream(InputStream stream, int
> > bytesPerRead) throws Exception {
> >             super(stream);
> >             this.bytesPerRead = bytesPerRead;
> >         }
> > 
> >         @Override
> >         public int read(byte[] b, int off, int len) throws IOException 
{
> >             if (off < 0 || len < 0 || len > b.length - off) {
> >                 throw new IndexOutOfBoundsException();
> >             } else if (len == 0) {
> >                 return 0;
> >             }
> > 
> >             // Limit bytes read
> >             int bytesToRead = Math.min(bytesPerRead, len);
> > 
> >             // Ensure deterministic behavior (similar to
> > org.apache.commons.io.IOUtils.read)
> >             // Useless for this test case, but convenient for 
> > consistently reproducing
> >             // the bug with other stream implementations
> >             int totalBytesRead = 0;
> >             int bytesRead = 0;
> >             do {
> >                 bytesRead = Math.max(0, in.read(b, off + 
> > totalBytesRead, bytesToRead));
> >                 bytesToRead -= bytesRead;
> >                 totalBytesRead += bytesRead;
> >             } while (bytesRead > 0);
> > 
> >             // No more bytes
> >             if (totalBytesRead == 0) {
> >                 return -1;
> >             }
> > 
> >             return totalBytesRead;
> >         }
> >     }
> > }
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: XMLStreamReader corrupting data

Reply via email to