XML parsing problems

2005-12-25 Thread Chris Burdess
Hi

We discovered over IRC that there is a major problem with XML parsing using
the StAX driver, caused by a bug in BufferedInputStream. I'm therefore
reverting the default XML parser to aelfred2 until this is resolved.

The bug is in both gnu.xml.stream.XMLInputStreamReader and
java.io.BufferedInputStream - the former uses almost identical code to the
second in order to provide mark/reset functionality.

As I understand it, the problem can occur when the position in the buffer is
near the end. If the mark is set at position 2047 in the buffer, then we read
2 bytes and reset, then refill() will have been called and the position is
actually reset to position 2047 in the new buffer, 2K further along in the
original stream.

As the StAX parser relies heavily on mark/reset behaviour to function
correctly, it will not parse entities greater than 2K in size reliably (it
depends what structures are at the 2K boundaries).

If anyone has a robust solution to this problem please apply it; I will try
to address it but may not have much free time before the new year/release.
-- 
Chris Burdess
  "They that can give up essential liberty to obtain a little safety
  deserve neither liberty nor safety." - Benjamin Franklin


pgpn0MWPJLBL0.pgp
Description: PGP signature
___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2005-12-25 Thread Per Bothner

Chris Burdess wrote:

As I understand it, the problem can occur when the position in the buffer is
near the end. If the mark is set at position 2047 in the buffer, then we read
2 bytes and reset, then refill() will have been called and the position is
actually reset to position 2047 in the new buffer, 2K further along in the
original stream.
...
If anyone has a robust solution to this problem please apply it; I will try
to address it but may not have much free time before the new year/release.


Three choices, assuming the mark is active and at position M
in a buffer of size S:
(1) Move S-M bytes to the start of the buffer and read upto M bytes.
Advantage: doesn't need to grow the buffer.
Disadvantage: we lose buffer alignment, which probably doesn't matter
except with a highly tuned (an nio-based?) implementation.
Also, can fail if the readAheadLimit > S.
(2) Grow the buffer to S+(S-M).
Disadvantage: Same alignment issue as (1).
(3) Use two buffers, one old and a new buffer.  The old buffer doesn't
need to be full-size, as long as it can handle the readAheadLimit
(in the general case) or at least S-M in this case.
Advantage: Preserves alignment.
Disadvantage: More complex.  After a reset, then a read(byte[],int, int)
will only get S-M bytes.

Assuming buffer alignment isn't an issue, then I'd say use a combination
of (1) if possible and (2) if needed.
--
--Per Bothner
[EMAIL PROTECTED]   http://per.bothner.com/


___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2005-12-27 Thread Chris Burdess
Chris Burdess wrote:
> We discovered over IRC that there is a major problem with XML parsing using
> the StAX driver, caused by a bug in BufferedInputStream. I'm therefore
> reverting the default XML parser to aelfred2 until this is resolved.

Further investigation revealed that the problem was more to do with the fact
that InputStreamReader buffers bytes (Sun's implementation does the same).
I have rewritten the XML parser's streaming mechanism to deal with this.

Note that the behaviour of BufferedInputStream is correct - but that of
BufferedReader is not. I don't yet know why, so I have added a new
gnu.xml.stream.BufferedReader class which mimics the behaviour of
BufferedInputStream.

Please test the new XML parser on as many weird and wonderful XML sources
as you can, and report any problems to me either by mail or Bugzilla - I will
try to deal with them before release, or we can revert to aelfred2 again if
there are other showstoppers.
-- 
Chris Burdess
  "They that can give up essential liberty to obtain a little safety
  deserve neither liberty nor safety." - Benjamin Franklin


pgpqNu6eoBtyQ.pgp
Description: PGP signature
___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2005-12-27 Thread Tom Tromey
> "Chris" == Chris Burdess <[EMAIL PROTECTED]> writes:

Chris> Note that the behaviour of BufferedInputStream is correct - but
Chris> that of BufferedReader is not.

Please file a PR for this.

Tom


___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2005-12-31 Thread Mark Wielaard
Hi Chris,

On Tue, 2005-12-27 at 20:03 +, Chris Burdess wrote:
> Please test the new XML parser on as many weird and wonderful XML
> sources as you can, and report any problems to me either by mail or
> Bugzilla - I will try to deal with them before release, or we can
> revert to aelfred2 again if there are other showstoppers.

Nice work! You are running some test-suites from time to time on the xml
parsers. If these are free then I would like to add them to
builder.classpath.org so regressions (or better conformance results) are
automatically tracked. How much space do they need and is it difficult
to setup?

Cheers,

Mark

-- 
Escape the Java Trap with GNU Classpath!
http://www.gnu.org/philosophy/java-trap.html

Join the community at http://planet.classpath.org/


signature.asc
Description: This is a digitally signed message part
___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2005-12-31 Thread Chris Burdess
Mark Wielaard wrote:
> You are running some test-suites from time to time on the xml
> parsers. If these are free then I would like to add them to
> builder.classpath.org so regressions (or better conformance results) are
> automatically tracked. How much space do they need and is it difficult
> to setup?

The DOM conformance tests take up about 15MB, the SAX tests around 640MB
(less if you want to test fewer parsers). It shouldn't be too difficult to
set up as I have harness scripts already.

There are also some XSLT/XPath conformance tests from OASIS, these take up
about 126MB.
-- 
Chris Burdess
  "They that can give up essential liberty to obtain a little safety
  deserve neither liberty nor safety." - Benjamin Franklin


pgpmphjshomJ6.pgp
Description: PGP signature
___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath


Re: XML parsing problems

2006-01-01 Thread Dalibor Topic
On Sat, Dec 31, 2005 at 04:04:47PM +, Chris Burdess wrote:
> Mark Wielaard wrote:
> > You are running some test-suites from time to time on the xml
> > parsers. If these are free then I would like to add them to
> > builder.classpath.org so regressions (or better conformance results) are
> > automatically tracked. How much space do they need and is it difficult
> > to setup?
> 
> The DOM conformance tests take up about 15MB, the SAX tests around 640MB
> (less if you want to test fewer parsers). It shouldn't be too difficult to
> set up as I have harness scripts already.
> 
> There are also some XSLT/XPath conformance tests from OASIS, these take up
> about 126MB.

Could we put it into Mauve?

cheers,
dalibor topic

> -- 
> Chris Burdess
>   "They that can give up essential liberty to obtain a little safety
>   deserve neither liberty nor safety." - Benjamin Franklin



> ___
> Classpath mailing list
> Classpath@gnu.org
> http://lists.gnu.org/mailman/listinfo/classpath



___
Classpath mailing list
Classpath@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath