XML parsing problems
Hi We discovered over IRC that there is a major problem with XML parsing using the StAX driver, caused by a bug in BufferedInputStream. I'm therefore reverting the default XML parser to aelfred2 until this is resolved. The bug is in both gnu.xml.stream.XMLInputStreamReader and java.io.BufferedInputStream - the former uses almost identical code to the second in order to provide mark/reset functionality. As I understand it, the problem can occur when the position in the buffer is near the end. If the mark is set at position 2047 in the buffer, then we read 2 bytes and reset, then refill() will have been called and the position is actually reset to position 2047 in the new buffer, 2K further along in the original stream. As the StAX parser relies heavily on mark/reset behaviour to function correctly, it will not parse entities greater than 2K in size reliably (it depends what structures are at the 2K boundaries). If anyone has a robust solution to this problem please apply it; I will try to address it but may not have much free time before the new year/release. -- Chris Burdess "They that can give up essential liberty to obtain a little safety deserve neither liberty nor safety." - Benjamin Franklin pgpn0MWPJLBL0.pgp Description: PGP signature ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
Chris Burdess wrote: As I understand it, the problem can occur when the position in the buffer is near the end. If the mark is set at position 2047 in the buffer, then we read 2 bytes and reset, then refill() will have been called and the position is actually reset to position 2047 in the new buffer, 2K further along in the original stream. ... If anyone has a robust solution to this problem please apply it; I will try to address it but may not have much free time before the new year/release. Three choices, assuming the mark is active and at position M in a buffer of size S: (1) Move S-M bytes to the start of the buffer and read upto M bytes. Advantage: doesn't need to grow the buffer. Disadvantage: we lose buffer alignment, which probably doesn't matter except with a highly tuned (an nio-based?) implementation. Also, can fail if the readAheadLimit > S. (2) Grow the buffer to S+(S-M). Disadvantage: Same alignment issue as (1). (3) Use two buffers, one old and a new buffer. The old buffer doesn't need to be full-size, as long as it can handle the readAheadLimit (in the general case) or at least S-M in this case. Advantage: Preserves alignment. Disadvantage: More complex. After a reset, then a read(byte[],int, int) will only get S-M bytes. Assuming buffer alignment isn't an issue, then I'd say use a combination of (1) if possible and (2) if needed. -- --Per Bothner [EMAIL PROTECTED] http://per.bothner.com/ ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
Chris Burdess wrote: > We discovered over IRC that there is a major problem with XML parsing using > the StAX driver, caused by a bug in BufferedInputStream. I'm therefore > reverting the default XML parser to aelfred2 until this is resolved. Further investigation revealed that the problem was more to do with the fact that InputStreamReader buffers bytes (Sun's implementation does the same). I have rewritten the XML parser's streaming mechanism to deal with this. Note that the behaviour of BufferedInputStream is correct - but that of BufferedReader is not. I don't yet know why, so I have added a new gnu.xml.stream.BufferedReader class which mimics the behaviour of BufferedInputStream. Please test the new XML parser on as many weird and wonderful XML sources as you can, and report any problems to me either by mail or Bugzilla - I will try to deal with them before release, or we can revert to aelfred2 again if there are other showstoppers. -- Chris Burdess "They that can give up essential liberty to obtain a little safety deserve neither liberty nor safety." - Benjamin Franklin pgpqNu6eoBtyQ.pgp Description: PGP signature ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
> "Chris" == Chris Burdess <[EMAIL PROTECTED]> writes: Chris> Note that the behaviour of BufferedInputStream is correct - but Chris> that of BufferedReader is not. Please file a PR for this. Tom ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
Hi Chris, On Tue, 2005-12-27 at 20:03 +, Chris Burdess wrote: > Please test the new XML parser on as many weird and wonderful XML > sources as you can, and report any problems to me either by mail or > Bugzilla - I will try to deal with them before release, or we can > revert to aelfred2 again if there are other showstoppers. Nice work! You are running some test-suites from time to time on the xml parsers. If these are free then I would like to add them to builder.classpath.org so regressions (or better conformance results) are automatically tracked. How much space do they need and is it difficult to setup? Cheers, Mark -- Escape the Java Trap with GNU Classpath! http://www.gnu.org/philosophy/java-trap.html Join the community at http://planet.classpath.org/ signature.asc Description: This is a digitally signed message part ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
Mark Wielaard wrote: > You are running some test-suites from time to time on the xml > parsers. If these are free then I would like to add them to > builder.classpath.org so regressions (or better conformance results) are > automatically tracked. How much space do they need and is it difficult > to setup? The DOM conformance tests take up about 15MB, the SAX tests around 640MB (less if you want to test fewer parsers). It shouldn't be too difficult to set up as I have harness scripts already. There are also some XSLT/XPath conformance tests from OASIS, these take up about 126MB. -- Chris Burdess "They that can give up essential liberty to obtain a little safety deserve neither liberty nor safety." - Benjamin Franklin pgpmphjshomJ6.pgp Description: PGP signature ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath
Re: XML parsing problems
On Sat, Dec 31, 2005 at 04:04:47PM +, Chris Burdess wrote: > Mark Wielaard wrote: > > You are running some test-suites from time to time on the xml > > parsers. If these are free then I would like to add them to > > builder.classpath.org so regressions (or better conformance results) are > > automatically tracked. How much space do they need and is it difficult > > to setup? > > The DOM conformance tests take up about 15MB, the SAX tests around 640MB > (less if you want to test fewer parsers). It shouldn't be too difficult to > set up as I have harness scripts already. > > There are also some XSLT/XPath conformance tests from OASIS, these take up > about 126MB. Could we put it into Mauve? cheers, dalibor topic > -- > Chris Burdess > "They that can give up essential liberty to obtain a little safety > deserve neither liberty nor safety." - Benjamin Franklin > ___ > Classpath mailing list > Classpath@gnu.org > http://lists.gnu.org/mailman/listinfo/classpath ___ Classpath mailing list Classpath@gnu.org http://lists.gnu.org/mailman/listinfo/classpath