Thomas, Also check that you're not falling into the typical usage errors with SAX. I see at least one issue in your test program (i.e. the s.startsWith("abc") call in characters()). Take a look at this FAQ [1].
Thanks. [1] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2 Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org Gary Gregory <ggreg...@seagullsoftware.com> wrote on 01/22/2010 01:57:08 PM: > For Xerces 2.9.1, did you add Xerces to your runtime through the > Java endorsed mechanism [1]? > > Gary > > [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/ > > > > -----Original Message----- > > From: Thomas Schleu [mailto:tsch...@canto.com] > > Sent: Friday, January 22, 2010 05:29 > > To: j-users@xerces.apache.org > > Subject: Parser passes garbage to characters() callback for XML > > containing character entities > > > > I can reproduce a problem parsing certain XML 1.1 files that contain > > lots of > > character entities (escaped control chars like ""). > > At some point in the file the parser calls my characters() method with > > garbage text. > > > > Here is the source code that generates such an XML file: > > > > FileOutputStream fos = new FileOutputStream (new File > > ("C:/test.xml")); > > fos.write ("<?xml version=\"1.1\" encoding=\"UTF-8\"?>\n<!DOCTYPE > > X>\n<ns:X xmlns:ns=\"http://www.mycompany.com/ns/X/1.0\">\n".getBytes > > ("UTF-8")); > > final byte[] bytes = > > ("<ns:item>abcdefghijklmnopqrstuvwxyz</ns:item>\n").getBytes > > ("UTF-8"); > > for (int i = 0; i < 314; i++) > > { > > fos.write(bytes); > > } > > fos.write ("</ns:X>".getBytes ("UTF-8")); > > fos.close (); > > > > The XML is very simple, it just contains lots of identical elements > > with > > "" in the body text. > > The parsing code looks like the following: > > > > FileInputStream fis = new FileInputStream (new File > > ("C:/test.xml")); > > final SAXParserFactory saxParserFactory = > > SAXParserFactory.newInstance > > (); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespaces", > > Boolean.TRUE); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespace-prefixes", Boolean.TRUE); > > final SAXParser parser = saxParserFactory.newSAXParser (); > > try > > { > > parser.parse (fis, new DefaultHandler() > > { > > StringBuilder sb = new StringBuilder (); > > String currentElement = null; > > > > public void startElement (String uri, String localName, > > String > > qName, Attributes attributes) throws SAXException > > { > > currentElement = localName; > > } > > public void characters (char ch[], int start, int length) > > throws > > SAXException > > { > > if ("item".equals (currentElement)) > > { > > String s = new String (ch, start, length); > > if (sb.length () == 0 && !s.startsWith ("abc")) > > { > > // THE PARSER CALLS ME WITH GARBAGE! > > System.out.println ("ERROR"); > > } > > sb.append (s); > > } > > } > > public void endElement (String uri, String localName, > > String > > qName) throws SAXException > > { > > if ("item".equals (localName)) > > { > > sb.delete (0, sb.length ()); > > currentElement = null; > > } > > } > > }); > > } > > catch (Exception e) > > { > > e.printStackTrace (); > > System.out.println ("e = " + e); > > } > > > > My characters() method checks whether the body text is the expected > > text > > starting with "abc". > > After 156 elements with the correct body text my method gets called > > with the > > text "x19;<fghijklmnopqrstuvwxyz" as the starting body text of the > > element. > > The XML code has to exceed 16kB to show this problem. It may be related > > to > > the 8kB internal buffer of the parser. > > > > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as > > well > > as with the separate xerces-2_9_1. All show the same behavior. > > I cannot work around this as I don't have control over the XML input. > > > > Anyone who can help me here? > > > > Thanks in Advance > > Thomas Schleu > > Chief Technology Officer > > > > Mail: mailto:tsch...@canto.com > > Fon: +49-30-390 485 0 > > Fax: +49-30-390 485 55 > > > > Canto GmbH > > Alt-Moabit 73 > > D-10555 Berlin > > Germany > > http://www.canto.com > > Amtsgericht Berlin-Charlottenburg HRB 88566 > > Geschäftsführer: Hans-Dieter Schädel > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org > > For additional commands, e-mail: j-users-h...@xerces.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org > For additional commands, e-mail: j-users-h...@xerces.apache.org