Here are my findings, which are pretty much a dump of what my debugger tells me 
when running the sample I posted.

The XML:
<?xml version="1.0" encoding="UTF-8"?>
<He likes="rugs" because="they really tie the room together"/>

For the start element "He", we first process the "likes" attribute, and extract 
its value:
Line 437:
http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
tmpStr is declared. It is passed as first argument of scanAttributeValue on the 
next line (438)

That invocation goes on line 835 of XMLScanner:
http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLScanner.java.html
We're in scanAttributeValue.
"fEntityScanner.scanLiteral(quote, value)" is invoked
where "value" is the tmpStr variable above

then, on line 1156:
http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLEntityScanner.java.html
"content.setValues(fCurrentEntity.ch, offset, length);"
So, tmpStr is now referencing the internal buffer of the current entity, with 
an offset and a length (offset=0 and length=4 in this case, according to my 
debugger)

Line 560:
http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
attributes now references tmpStr, which points to the internal entity buffer

Then, the flow continues to:
Line 254 (next iteration of the loop) we parse the next attribute "because"
http://cr.openjdk.java.net/~joehw/jdk8/8029236/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLNSDocumentScannerImpl.java.html
scanAttribute is invoked, which in turns invokes scanQName:

And then, the corruption actually happens here, on line 779:
http://cr.openjdk.java.net/~joehw/jdk9/8027359/webrev/src/com/sun/org/apache/xerces/internal/impl/XMLEntityScanner.java.html
The first character of the internal buffer is overwritten - but in this 
example, this character is actually part of the XMLString which is supposed to 
contain the previous attribute value (because offset=0). So, "rugs" is changed 
into "bugs"

>> So, every time an XMLString with offset=0 is created for an attribute value, 
>> its first character is going to be overwritten by the first character of the 
>> next attribute name, if any.

I hope this is convincing! I don't know where the actual regression is, but it 
really seems to me that the bug is in Xerces, since this execution flow does 
not leave Xerces code before the corruption happens.


Thanks,

Victor Michel
Amazon Web Services


-----Original Message-----
From: Michel, Victor 
Sent: Monday, November 03, 2014 4:16 PM
To: '[email protected]'
Subject: RE: XMLStreamReader corrupting data

I should have named this email thread " XMLEntityScanner/XMLEntityManager 
corrupting data" - sorry for the confusion.
My debugger points me to these two classes, even though I have yet been unable 
to pinpoint the bug in the code

The repro case I've sent is fairly self-explanatory - the "b" from "because" 
ends up overwriting the "r" of "rugs".

I used a custom implementation of InputStream in the repro case I sent (to keep 
it small), but I have seen the bug happen with other implementations of 
InputStream and much larger XML files.
The corruption happens silently, which makes that bug pretty tricky to detect

Thanks,

Victor Michel
Amazon Web Services


-----Original Message-----
From: Michel, Victor
Sent: Monday, November 03, 2014 10:52 AM
To: [email protected]
Subject: RE: XMLStreamReader corrupting data

Hi,

Thanks for the answer
XMLStreamReader is implemented by:
> public class XMLStreamReaderImpl implements 
> javax.xml.stream.XMLStreamReader  (in package 
> com.sun.org.apache.xerces.internal.impl )
It relies heavily on XMLEntityScanner, which is, I believe, the cause of the 
bug.

XMLEntityScanner seems to be part of the Xerces library 
https://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/impl/XMLEntityScanner.html

Maybe I misunderstood something? In any case, I have filed a bug on the Oracle 
website

Thanks,

Victor

-----Original Message-----
From: Michael Glavassevich [mailto:[email protected]]
Sent: Monday, November 03, 2014 8:30 AM
To: [email protected]
Subject: Re: XMLStreamReader corrupting data

Hello,

Apache Xerces does not contain an implementation of the XMLStreamReader 
interface. The component you're using would have been developed by Oracle/Sun 
and has not been contributed to Apache. We wouldn't know anything about the 
problem you're experiencing with StAX. Probably better to ask your question on 
one of the JDK forums.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

"Michel, Victor" <[email protected]> wrote on 11/03/2014 03:12:20 AM:

> Hi all,
> 
> I'd like to report something that looks like a bug in the version of 
> Xerces included in JRE 7u71/7u72/8u20/8u25 The StAX API seems to 
> produce corrupted data, depending on how many bytes the underlying 
> InputStream is actually reading at each invocation of read(byte[], 
> int, int)
> 
> The following repro case will lead to different results depending on 
> the version of the JRE. Am I doing something wrong?
> 
> Thanks,
> 
> Victor
> 
> ------
> 
> import java.io.ByteArrayInputStream;
> import java.io.FilterInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.nio.charset.Charset;
> 
> import javax.xml.stream.XMLInputFactory; import 
> javax.xml.stream.XMLStreamReader;
> 
> /*
>  * Correct output (7u67,8u11)
>  * rugs
>  *
>  * Incorrect output (7u71,7u72,8u20,8u25)
>  * bugs
>  */
> public class XmlReaderBug {
> 
>     private static final int BYTES_PER_READ = 6;
> 
>     private static final String XML =
>         "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
>         "<He likes=\"rugs\" because=\"they really tie the room
together\"/>";
> 
>     public static void main(String[] args) throws Exception {
>         final InputStream xmlStream = new ByteArrayInputStream 
> (XML.getBytes(Charset.forName("UTF-8")));
>         final InputStream throttledXmlStream = new 
> ThrottledInputStream(xmlStream, BYTES_PER_READ);
> 
>         final XMLInputFactory xmlFactory =
XMLInputFactory.newInstance();
>         final XMLStreamReader xmlStreamReader = 
> xmlFactory.createXMLStreamReader(throttledXmlStream);
>         xmlStreamReader.next();
> 
>         // bugs or rugs?
>         System.out.println(xmlStreamReader.getAttributeValue(null,
"likes"));
>     }
> 
>     // An InputStream implementation that limits the number of bytes 
> read by read(byte[], int, int)
>     private static class ThrottledInputStream extends 
> FilterInputStream
{
>         private final int bytesPerRead;
> 
>         public ThrottledInputStream(InputStream stream, int
> bytesPerRead) throws Exception {
>             super(stream);
>             this.bytesPerRead = bytesPerRead;
>         }
> 
>         @Override
>         public int read(byte[] b, int off, int len) throws IOException {
>             if (off < 0 || len < 0 || len > b.length - off) {
>                 throw new IndexOutOfBoundsException();
>             } else if (len == 0) {
>                 return 0;
>             }
> 
>             // Limit bytes read
>             int bytesToRead = Math.min(bytesPerRead, len);
> 
>             // Ensure deterministic behavior (similar to
> org.apache.commons.io.IOUtils.read)
>             // Useless for this test case, but convenient for 
> consistently reproducing
>             // the bug with other stream implementations
>             int totalBytesRead = 0;
>             int bytesRead = 0;
>             do {
>                 bytesRead = Math.max(0, in.read(b, off + 
> totalBytesRead, bytesToRead));
>                 bytesToRead -= bytesRead;
>                 totalBytesRead += bytesRead;
>             } while (bytesRead > 0);
> 
>             // No more bytes
>             if (totalBytesRead == 0) {
>                 return -1;
>             }
> 
>             return totalBytesRead;
>         }
>     }
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to