RE: Parser passes garbage to characters() callback for XML containing character entities
Thomas, "Thomas Schleu" wrote on 01/27/2010 07:47:09 AM: > Michael, > > I know that the body text comes in pieces. That's why I check that the > accumulated text buffer (sb) is empty when looking at the start of the > characters. The code you posted is assuming that the beginning of the first chunk will start with "abc". There is no such guarantee. The text can be split anywhere and when I ran your program I observed that for one of the elements "abc" crosses a buffer boundary so on the first callback you only get the first two characters: "ab". Your code needs to account for this. I see no issue with Xerces. > I also only check when I am inside the "item" element. > The XML is very simple. It just repeats the same element over and over > again. > As I mentioned before the error comes when the XML total size exceeds 16kB > and occurs when parsing the XML element that is behind the first 8kB. > I looked at the parser source shortly and noticed that it uses an internal > buffer of 8kB. That's why I assume the problem occurs when re-filling the > buffer while in the middle of or after processing a character entity > "". I'm not sure what source you're looking at. Xerces' default buffer size is 2 KB. It's been that size for a long time. Are you sure you're actually using Apache Xerces and not some derivative like what Sun ships in their JDK? > Once I removed all those character entities the parser worked as expected. > > Any help you can give? > Thomas Schleu > Chief Technology Officer > > Mail: mailto:tsch...@canto.com > Fon: +49-30-390 485 0 > Fax: +49-30-390 485 55 > > Canto GmbH > Alt-Moabit 73 > D-10555 Berlin > Germany > http://www.canto.com > Amtsgericht Berlin-Charlottenburg HRB 88566 > Geschäftsführer: Hans-Dieter Schädel Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org
RE: Parser passes garbage to characters() callback for XML containing character entities
Michael, I know that the body text comes in pieces. That's why I check that the accumulated text buffer (sb) is empty when looking at the start of the characters. I also only check when I am inside the "item" element. The XML is very simple. It just repeats the same element over and over again. As I mentioned before the error comes when the XML total size exceeds 16kB and occurs when parsing the XML element that is behind the first 8kB. I looked at the parser source shortly and noticed that it uses an internal buffer of 8kB. That's why I assume the problem occurs when re-filling the buffer while in the middle of or after processing a character entity "". Once I removed all those character entities the parser worked as expected. Any help you can give? Thomas Schleu Chief Technology Officer Mail: mailto:tsch...@canto.com Fon: +49-30-390 485 0 Fax: +49-30-390 485 55 Canto GmbH Alt-Moabit 73 D-10555 Berlin Germany http://www.canto.com Amtsgericht Berlin-Charlottenburg HRB 88566 Geschäftsführer: Hans-Dieter Schädel > -Original Message- > From: Gary Gregory [mailto:ggreg...@seagullsoftware.com] > Sent: Freitag, 22. Januar 2010 19:57 > To: j-users@xerces.apache.org; tsch...@canto.com > Subject: RE: Parser passes garbage to characters() callback for XML > containing character entities > > For Xerces 2.9.1, did you add Xerces to your runtime through the Java > endorsed mechanism [1]? > > Gary > > [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/ > > > > -Original Message- > > From: Thomas Schleu [mailto:tsch...@canto.com] > > Sent: Friday, January 22, 2010 05:29 > > To: j-users@xerces.apache.org > > Subject: Parser passes garbage to characters() callback for XML > > containing character entities > > > > I can reproduce a problem parsing certain XML 1.1 files that contain > > lots of > > character entities (escaped control chars like ""). > > At some point in the file the parser calls my characters() method > with > > garbage text. > > > > Here is the source code that generates such an XML file: > > > > FileOutputStream fos = new FileOutputStream (new File > > ("C:/test.xml")); > > fos.write ("\n > X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes > > ("UTF-8")); > > final byte[] bytes = > > ("abcdefghijklmnopqrstuvwxyz\n").getBytes > > ("UTF-8"); > > for (int i = 0; i < 314; i++) > > { > > fos.write(bytes); > > } > > fos.write ("".getBytes ("UTF-8")); > > fos.close (); > > > > The XML is very simple, it just contains lots of identical elements > > with > > "" in the body text. > > The parsing code looks like the following: > > > > FileInputStream fis = new FileInputStream (new File > > ("C:/test.xml")); > > final SAXParserFactory saxParserFactory = > > SAXParserFactory.newInstance > > (); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespaces";, > > Boolean.TRUE); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE); > > final SAXParser parser = saxParserFactory.newSAXParser (); > > try > > { > > parser.parse (fis, new DefaultHandler() > > { > > StringBuilder sb = new StringBuilder (); > > String currentElement = null; > > > > public void startElement (String uri, String localName, > > String > > qName, Attributes attributes) throws SAXException > > { > > currentElement = localName; > > } > > public void characters (char ch[], int start, int length) > > throws > > SAXException > > { > > if ("item".equals (currentElement)) > > { > > String s = new String (ch, start, length); > > if (sb.length () == 0 && !s.startsWith ("abc")) > > { > > // THE PARSER CALLS ME WITH GARBAGE! > > System.out.println ("ERROR"); > > } > > sb.append (s); > > } > > } > > public void endElement (String uri, String localName, > > String > > qName) throws SAXException > > { > > if ("item"
RE: Parser passes garbage to characters() callback for XML containing character entities
Thomas, Also check that you're not falling into the typical usage errors with SAX. I see at least one issue in your test program (i.e. the s.startsWith("abc") call in characters()). Take a look at this FAQ [1]. Thanks. [1] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2 Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org Gary Gregory wrote on 01/22/2010 01:57:08 PM: > For Xerces 2.9.1, did you add Xerces to your runtime through the > Java endorsed mechanism [1]? > > Gary > > [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/ > > > > -Original Message- > > From: Thomas Schleu [mailto:tsch...@canto.com] > > Sent: Friday, January 22, 2010 05:29 > > To: j-users@xerces.apache.org > > Subject: Parser passes garbage to characters() callback for XML > > containing character entities > > > > I can reproduce a problem parsing certain XML 1.1 files that contain > > lots of > > character entities (escaped control chars like ""). > > At some point in the file the parser calls my characters() method with > > garbage text. > > > > Here is the source code that generates such an XML file: > > > > FileOutputStream fos = new FileOutputStream (new File > > ("C:/test.xml")); > > fos.write ("\n > X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes > > ("UTF-8")); > > final byte[] bytes = > > ("abcdefghijklmnopqrstuvwxyz\n").getBytes > > ("UTF-8"); > > for (int i = 0; i < 314; i++) > > { > > fos.write(bytes); > > } > > fos.write ("".getBytes ("UTF-8")); > > fos.close (); > > > > The XML is very simple, it just contains lots of identical elements > > with > > "" in the body text. > > The parsing code looks like the following: > > > > FileInputStream fis = new FileInputStream (new File > > ("C:/test.xml")); > > final SAXParserFactory saxParserFactory = > > SAXParserFactory.newInstance > > (); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespaces";, > > Boolean.TRUE); > > saxParserFactory.setFeature > > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE); > > final SAXParser parser = saxParserFactory.newSAXParser (); > > try > > { > > parser.parse (fis, new DefaultHandler() > > { > > StringBuilder sb = new StringBuilder (); > > String currentElement = null; > > > > public void startElement (String uri, String localName, > > String > > qName, Attributes attributes) throws SAXException > > { > > currentElement = localName; > > } > > public void characters (char ch[], int start, int length) > > throws > > SAXException > > { > > if ("item".equals (currentElement)) > > { > > String s = new String (ch, start, length); > > if (sb.length () == 0 && !s.startsWith ("abc")) > > { > > // THE PARSER CALLS ME WITH GARBAGE! > > System.out.println ("ERROR"); > > } > > sb.append (s); > > } > > } > > public void endElement (String uri, String localName, > > String > > qName) throws SAXException > > { > > if ("item".equals (localName)) > > { > > sb.delete (0, sb.length ()); > > currentElement = null; > > } > > } > > }); > > } > > catch (Exception e) > > { > > e.printStackTrace (); > > System.out.println ("e = " + e); > > } > > > > My characters() method checks whether the body text is the expected > > text > > starting with "abc". > > After 156 elements with the correct body text my method gets called > > with the > > text "x19; > element. > > The XML code has to exceed 16kB to show this problem. It may be related > > to > > the 8kB internal buffer of the parser. > > > > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as > > well > > as with the separate xerces-2_9_1. All show the same behavior. > > I cannot work around this as I don't have control over the XML input. > > > > Anyone who can help me here? > > > > Thanks in Advance > > Thomas Schleu > > Chief Technology Officer > > > > Mail: mailto:tsch...@canto.com > > Fon: +49-30-390 485 0 > > Fax: +49-30-390 485 55 > > > > Canto GmbH > > Alt-Moabit 73 > > D-10555 Berlin > > Germany > > http://www.canto.com > > Amtsgericht Berlin-Charlottenburg HRB 88566 > > Geschäftsführer: Hans-Dieter Schädel > > > > > > > > > > - > > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org > > For additional commands, e-mail: j-users-h...@xerces.apache.org > > > - > To unsubscrib
RE: Parser passes garbage to characters() callback for XML containing character entities
For Xerces 2.9.1, did you add Xerces to your runtime through the Java endorsed mechanism [1]? Gary [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/ > -Original Message- > From: Thomas Schleu [mailto:tsch...@canto.com] > Sent: Friday, January 22, 2010 05:29 > To: j-users@xerces.apache.org > Subject: Parser passes garbage to characters() callback for XML > containing character entities > > I can reproduce a problem parsing certain XML 1.1 files that contain > lots of > character entities (escaped control chars like ""). > At some point in the file the parser calls my characters() method with > garbage text. > > Here is the source code that generates such an XML file: > > FileOutputStream fos = new FileOutputStream (new File > ("C:/test.xml")); > fos.write ("\n X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes > ("UTF-8")); > final byte[] bytes = > ("abcdefghijklmnopqrstuvwxyz\n").getBytes > ("UTF-8"); > for (int i = 0; i < 314; i++) > { > fos.write(bytes); > } > fos.write ("".getBytes ("UTF-8")); > fos.close (); > > The XML is very simple, it just contains lots of identical elements > with > "" in the body text. > The parsing code looks like the following: > > FileInputStream fis = new FileInputStream (new File > ("C:/test.xml")); > final SAXParserFactory saxParserFactory = > SAXParserFactory.newInstance > (); > saxParserFactory.setFeature > ("http://xml.org/sax/features/namespaces";, > Boolean.TRUE); > saxParserFactory.setFeature > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE); > final SAXParser parser = saxParserFactory.newSAXParser (); > try > { > parser.parse (fis, new DefaultHandler() > { > StringBuilder sb = new StringBuilder (); > String currentElement = null; > > public void startElement (String uri, String localName, > String > qName, Attributes attributes) throws SAXException > { > currentElement = localName; > } > public void characters (char ch[], int start, int length) > throws > SAXException > { > if ("item".equals (currentElement)) > { > String s = new String (ch, start, length); > if (sb.length () == 0 && !s.startsWith ("abc")) > { > // THE PARSER CALLS ME WITH GARBAGE! > System.out.println ("ERROR"); > } > sb.append (s); > } > } > public void endElement (String uri, String localName, > String > qName) throws SAXException > { > if ("item".equals (localName)) > { > sb.delete (0, sb.length ()); > currentElement = null; > } > } > }); > } > catch (Exception e) > { > e.printStackTrace (); > System.out.println ("e = " + e); > } > > My characters() method checks whether the body text is the expected > text > starting with "abc". > After 156 elements with the correct body text my method gets called > with the > text "x19; element. > The XML code has to exceed 16kB to show this problem. It may be related > to > the 8kB internal buffer of the parser. > > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as > well > as with the separate xerces-2_9_1. All show the same behavior. > I cannot work around this as I don't have control over the XML input. > > Anyone who can help me here? > > Thanks in Advance > Thomas Schleu > Chief Technology Officer > > Mail: mailto:tsch...@canto.com > Fon: +49-30-390 485 0 > Fax: +49-30-390 485 55 > > Canto GmbH > Alt-Moabit 73 > D-10555 Berlin > Germany > http://www.canto.com > Amtsgericht Berlin-Charlottenburg HRB 88566 > Geschäftsführer: Hans-Dieter Schädel > > > > > - > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org > For additional commands, e-mail: j-users-h...@xerces.apache.org - To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org For additional commands, e-mail: j-users-h...@xerces.apache.org
RE: Parser passes garbage to characters() callback for XML containing character entities
FWIW, the page http://xerces.apache.org/#xerces2-j claims "XML 1.1 (2nd Edition)" Gary From: kesh...@us.ibm.com [mailto:kesh...@us.ibm.com] Sent: Friday, January 22, 2010 10:46 To: j-users@xerces.apache.org Cc: j-users@xerces.apache.org Subject: Re: Parser passes garbage to characters() callback for XML containing character entities > I can reproduce a problem parsing certain XML 1.1 files that contain lots of > character entities (escaped control chars like ""). > At some point in the file the parser calls my characters() method with > garbage text. In XML 1.0, most control characters were simply illegal. Did we ever update the Apache code to handle XML 1.1?
Re: Parser passes garbage to characters() callback for XML containing character entities
kesh...@us.ibm.com wrote on 01/22/2010 01:46:10 PM: > > I can reproduce a problem parsing certain XML 1.1 files that contain lots of > > character entities (escaped control chars like ""). > > At some point in the file the parser calls my characters() method with > > garbage text. > > In XML 1.0, most control characters were simply illegal. Did we ever > update the Apache code to handle XML 1.1? Yes, since 2003 and perhaps even earlier than that. Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrgla...@ca.ibm.com E-mail: mrgla...@apache.org
Re: Parser passes garbage to characters() callback for XML containing character entities
> I can reproduce a problem parsing certain XML 1.1 files that contain lots of > character entities (escaped control chars like ""). > At some point in the file the parser calls my characters() method with > garbage text. In XML 1.0, most control characters were simply illegal. Did we ever update the Apache code to handle XML 1.1?