Thomas,

Also check that you're not falling into the typical usage errors with SAX.
I see at least one issue in your test program (i.e. the s.startsWith("abc")
call in characters()). Take a look at this FAQ [1].

Thanks.

[1] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Gary Gregory <ggreg...@seagullsoftware.com> wrote on 01/22/2010 01:57:08
PM:

> For Xerces 2.9.1, did you add Xerces to your runtime through the
> Java endorsed mechanism [1]?
>
> Gary
>
> [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/
>
>
> > -----Original Message-----
> > From: Thomas Schleu [mailto:tsch...@canto.com]
> > Sent: Friday, January 22, 2010 05:29
> > To: j-users@xerces.apache.org
> > Subject: Parser passes garbage to characters() callback for XML
> > containing character entities
> >
> > I can reproduce a problem parsing certain XML 1.1 files that contain
> > lots of
> > character entities (escaped control chars like "&#x19;").
> > At some point in the file the parser calls my characters() method with
> > garbage text.
> >
> > Here is the source code that generates such an XML file:
> >
> >     FileOutputStream fos = new FileOutputStream (new File
> > ("C:/test.xml"));
> >     fos.write ("<?xml version=\"1.1\" encoding=\"UTF-8\"?>\n<!DOCTYPE
> > X>\n<ns:X xmlns:ns=\"http://www.mycompany.com/ns/X/1.0\";>\n".getBytes
> > ("UTF-8"));
> >     final byte[] bytes =
> > ("<ns:item>abcdefghijklmnopqrstuvwxyz&#x19;</ns:item>\n").getBytes
> > ("UTF-8");
> >     for (int i = 0; i < 314; i++)
> >     {
> >         fos.write(bytes);
> >     }
> >     fos.write ("</ns:X>".getBytes ("UTF-8"));
> >     fos.close ();
> >
> > The XML is very simple, it just  contains lots of identical elements
> > with
> > "&#x19;" in the body text.
> > The parsing code looks like the following:
> >
> >     FileInputStream fis = new FileInputStream (new File
> > ("C:/test.xml"));
> >     final SAXParserFactory saxParserFactory =
> > SAXParserFactory.newInstance
> > ();
> >     saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespaces";,
> > Boolean.TRUE);
> >     saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE);
> >     final SAXParser parser = saxParserFactory.newSAXParser ();
> >     try
> >     {
> >         parser.parse (fis, new DefaultHandler()
> >         {
> >             StringBuilder sb = new StringBuilder ();
> >             String currentElement = null;
> >
> >             public void startElement (String uri, String localName,
> > String
> > qName, Attributes attributes) throws SAXException
> >             {
> >                 currentElement = localName;
> >             }
> >             public void characters (char ch[], int start, int length)
> > throws
> > SAXException
> >             {
> >                 if ("item".equals (currentElement))
> >                 {
> >                     String s = new String (ch, start, length);
> >                     if (sb.length () == 0 && !s.startsWith ("abc"))
> >                     {
> >                         // THE PARSER CALLS ME WITH GARBAGE!
> >                         System.out.println ("ERROR");
> >                     }
> >                     sb.append (s);
> >                 }
> >             }
> >             public void endElement (String uri, String localName,
> > String
> > qName) throws SAXException
> >             {
> >                 if ("item".equals (localName))
> >                 {
> >                     sb.delete (0, sb.length ());
> >                     currentElement = null;
> >                 }
> >             }
> >         });
> >     }
> >     catch (Exception e)
> >     {
> >         e.printStackTrace ();
> >         System.out.println ("e = " + e);
> >     }
> >
> > My characters() method checks whether the body text is the expected
> > text
> > starting with "abc".
> > After 156 elements with the correct body text my method gets called
> > with the
> > text "x19;<fghijklmnopqrstuvwxyz" as the starting body text of the
> > element.
> > The XML code has to exceed 16kB to show this problem. It may be related
> > to
> > the 8kB internal buffer of the parser.
> >
> > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as
> > well
> > as with the separate xerces-2_9_1. All show the same behavior.
> > I cannot work around this as I don't have control over the XML input.
> >
> > Anyone who can help me here?
> >
> > Thanks in Advance
> > Thomas Schleu
> > Chief Technology Officer
> >
> > Mail: mailto:tsch...@canto.com
> > Fon:  +49-30-390 485 0
> > Fax:  +49-30-390 485 55
> >
> > Canto GmbH
> > Alt-Moabit 73
> > D-10555 Berlin
> > Germany
> > http://www.canto.com
> > Amtsgericht Berlin-Charlottenburg HRB 88566
> > Geschäftsführer: Hans-Dieter Schädel
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> > For additional commands, e-mail: j-users-h...@xerces.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org

Reply via email to