You may have found a bug in OpenJDK, but OpenJDK != Xerces. What you're looking at isn't the Apache codebase. We have no influence over changes made to OpenJDK.
Michael Glavassevich XML Technologies and WAS Development IBM Toronto Lab E-mail: [email protected] E-mail: [email protected] "Michel, Victor" <[email protected]> wrote on 11/03/2014 10:03:50 PM: > Here are my findings, which are pretty much a dump of what my > debugger tells me when running the sample I posted. > > The XML: > <?xml version="1.0" encoding="UTF-8"?> > <He likes="rugs" because="they really tie the room together"/> > > For the start element "He", we first process the "likes" attribute, > and extract its value: > Line 437: > http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html > tmpStr is declared. It is passed as first argument of > scanAttributeValue on the next line (438) > > That invocation goes on line 835 of XMLScanner: > http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLScanner.java.html > We're in scanAttributeValue. > "fEntityScanner.scanLiteral(quote, value)" is invoked > where "value" is the tmpStr variable above > > then, on line 1156: > http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLEntityScanner.java.html > "content.setValues(fCurrentEntity.ch, offset, length);" > So, tmpStr is now referencing the internal buffer of the current > entity, with an offset and a length (offset=0 and length=4 in this > case, according to my debugger) > > Line 560: > http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html > attributes now references tmpStr, which points to the internal entity buffer > > Then, the flow continues to: > Line 254 (next iteration of the loop) we parse the next attribute "because" > http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html > scanAttribute is invoked, which in turns invokes scanQName: > > And then, the corruption actually happens here, on line 779: > http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/ > org/apache/xerces/internal/impl/XMLEntityScanner.java.html > The first character of the internal buffer is overwritten - but in > this example, this character is actually part of the XMLString which > is supposed to contain the previous attribute value (because > offset=0). So, "rugs" is changed into "bugs" > > >> So, every time an XMLString with offset=0 is created for an > attribute value, its first character is going to be overwritten by > the first character of the next attribute name, if any. > > I hope this is convincing! I don't know where the actual regression > is, but it really seems to me that the bug is in Xerces, since this > execution flow does not leave Xerces code before the corruption happens. > > > Thanks, > > Victor Michel > Amazon Web Services > > > -----Original Message----- > From: Michel, Victor > Sent: Monday, November 03, 2014 4:16 PM > To: '[email protected]' > Subject: RE: XMLStreamReader corrupting data > > I should have named this email thread " XMLEntityScanner/ > XMLEntityManager corrupting data" - sorry for the confusion. > My debugger points me to these two classes, even though I have yet > been unable to pinpoint the bug in the code > > The repro case I've sent is fairly self-explanatory - the "b" from > "because" ends up overwriting the "r" of "rugs". > > I used a custom implementation of InputStream in the repro case I > sent (to keep it small), but I have seen the bug happen with other > implementations of InputStream and much larger XML files. > The corruption happens silently, which makes that bug pretty tricky to detect > > Thanks, > > Victor Michel > Amazon Web Services > > > -----Original Message----- > From: Michel, Victor > Sent: Monday, November 03, 2014 10:52 AM > To: [email protected] > Subject: RE: XMLStreamReader corrupting data > > Hi, > > Thanks for the answer > XMLStreamReader is implemented by: > > public class XMLStreamReaderImpl implements > > javax.xml.stream.XMLStreamReader (in package > > com.sun.org.apache.xerces.internal.impl ) > It relies heavily on XMLEntityScanner, which is, I believe, the > cause of the bug. > > XMLEntityScanner seems to be part of the Xerces library https:// > xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/impl/ > XMLEntityScanner.html > > Maybe I misunderstood something? In any case, I have filed a bug on > the Oracle website > > Thanks, > > Victor > > -----Original Message----- > From: Michael Glavassevich [mailto:[email protected]] > Sent: Monday, November 03, 2014 8:30 AM > To: [email protected] > Subject: Re: XMLStreamReader corrupting data > > Hello, > > Apache Xerces does not contain an implementation of the > XMLStreamReader interface. The component you're using would have > been developed by Oracle/Sun and has not been contributed to Apache. > We wouldn't know anything about the problem you're experiencing with > StAX. Probably better to ask your question on one of the JDK forums. > > Thanks. > > Michael Glavassevich > XML Technologies and WAS Development > IBM Toronto Lab > E-mail: [email protected] > E-mail: [email protected] > > "Michel, Victor" <[email protected]> wrote on 11/03/2014 03:12:20 AM: > > > Hi all, > > > > I'd like to report something that looks like a bug in the version of > > Xerces included in JRE 7u71/7u72/8u20/8u25 The StAX API seems to > > produce corrupted data, depending on how many bytes the underlying > > InputStream is actually reading at each invocation of read(byte[], > > int, int) > > > > The following repro case will lead to different results depending on > > the version of the JRE. Am I doing something wrong? > > > > Thanks, > > > > Victor > > > > ------ > > > > import java.io.ByteArrayInputStream; > > import java.io.FilterInputStream; > > import java.io.IOException; > > import java.io.InputStream; > > import java.nio.charset.Charset; > > > > import javax.xml.stream.XMLInputFactory; import > > javax.xml.stream.XMLStreamReader; > > > > /* > > * Correct output (7u67,8u11) > > * rugs > > * > > * Incorrect output (7u71,7u72,8u20,8u25) > > * bugs > > */ > > public class XmlReaderBug { > > > > private static final int BYTES_PER_READ = 6; > > > > private static final String XML = > > "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + > > "<He likes=\"rugs\" because=\"they really tie the room > together\"/>"; > > > > public static void main(String[] args) throws Exception { > > final InputStream xmlStream = new ByteArrayInputStream > > (XML.getBytes(Charset.forName("UTF-8"))); > > final InputStream throttledXmlStream = new > > ThrottledInputStream(xmlStream, BYTES_PER_READ); > > > > final XMLInputFactory xmlFactory = > XMLInputFactory.newInstance(); > > final XMLStreamReader xmlStreamReader = > > xmlFactory.createXMLStreamReader(throttledXmlStream); > > xmlStreamReader.next(); > > > > // bugs or rugs? > > System.out.println(xmlStreamReader.getAttributeValue(null, > "likes")); > > } > > > > // An InputStream implementation that limits the number of bytes > > read by read(byte[], int, int) > > private static class ThrottledInputStream extends > > FilterInputStream > { > > private final int bytesPerRead; > > > > public ThrottledInputStream(InputStream stream, int > > bytesPerRead) throws Exception { > > super(stream); > > this.bytesPerRead = bytesPerRead; > > } > > > > @Override > > public int read(byte[] b, int off, int len) throws IOException { > > if (off < 0 || len < 0 || len > b.length - off) { > > throw new IndexOutOfBoundsException(); > > } else if (len == 0) { > > return 0; > > } > > > > // Limit bytes read > > int bytesToRead = Math.min(bytesPerRead, len); > > > > // Ensure deterministic behavior (similar to > > org.apache.commons.io.IOUtils.read) > > // Useless for this test case, but convenient for > > consistently reproducing > > // the bug with other stream implementations > > int totalBytesRead = 0; > > int bytesRead = 0; > > do { > > bytesRead = Math.max(0, in.read(b, off + > > totalBytesRead, bytesToRead)); > > bytesToRead -= bytesRead; > > totalBytesRead += bytesRead; > > } while (bytesRead > 0); > > > > // No more bytes > > if (totalBytesRead == 0) { > > return -1; > > } > > > > return totalBytesRead; > > } > > } > > } > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
