RE: Parser passes garbage to characters() callback for XML containing character entities

2010-01-27 Thread Michael Glavassevich

Thomas,

"Thomas Schleu"  wrote on 01/27/2010 07:47:09 AM:

> Michael,
>
> I know that the body text comes in pieces. That's why I check that the
> accumulated text buffer (sb) is empty when looking at the start of the
> characters.

The code you posted is assuming that the beginning of the first chunk will
start with "abc". There is no such guarantee. The text can be split
anywhere and when I ran your program I observed that for one of the
elements "abc" crosses a buffer boundary so on the first callback you only
get the first two characters: "ab". Your code needs to account for this. I
see no issue with Xerces.

> I also only check when I am inside the "item" element.
> The XML is very simple. It just repeats the same element over and over
> again.
> As I mentioned before the error comes when the XML total size exceeds
16kB
> and occurs when parsing the XML element that is behind the first 8kB.
> I looked at the parser source shortly and noticed that it uses an
internal
> buffer of 8kB. That's why I assume the problem occurs when re-filling the
> buffer while in the middle of or after processing a character entity
> "".

I'm not sure what source you're looking at. Xerces' default buffer size is
2 KB. It's been that size for a long time. Are you sure you're actually
using Apache Xerces and not some derivative like what Sun ships in their
JDK?

> Once I removed all those character entities the parser worked as
expected.
>
> Any help you can give?
> Thomas Schleu
> Chief Technology Officer
>
> Mail: mailto:tsch...@canto.com
> Fon:  +49-30-390 485 0
> Fax:  +49-30-390 485 55
>
> Canto GmbH
> Alt-Moabit 73
> D-10555 Berlin
> Germany
> http://www.canto.com
> Amtsgericht Berlin-Charlottenburg HRB 88566
> Geschäftsführer: Hans-Dieter Schädel

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

RE: Parser passes garbage to characters() callback for XML containing character entities

2010-01-27 Thread Thomas Schleu
Michael,

I know that the body text comes in pieces. That's why I check that the
accumulated text buffer (sb) is empty when looking at the start of the
characters.
I also only check when I am inside the "item" element.
The XML is very simple. It just repeats the same element over and over
again.
As I mentioned before the error comes when the XML total size exceeds 16kB
and occurs when parsing the XML element that is behind the first 8kB.
I looked at the parser source shortly and noticed that it uses an internal
buffer of 8kB. That's why I assume the problem occurs when re-filling the
buffer while in the middle of or after processing a character entity
"".
Once I removed all those character entities the parser worked as expected.

Any help you can give?
Thomas Schleu
Chief Technology Officer

Mail: mailto:tsch...@canto.com
Fon:  +49-30-390 485 0
Fax:  +49-30-390 485 55

Canto GmbH
Alt-Moabit 73
D-10555 Berlin
Germany
http://www.canto.com
Amtsgericht Berlin-Charlottenburg HRB 88566
Geschäftsführer: Hans-Dieter Schädel


> -Original Message-
> From: Gary Gregory [mailto:ggreg...@seagullsoftware.com]
> Sent: Freitag, 22. Januar 2010 19:57
> To: j-users@xerces.apache.org; tsch...@canto.com
> Subject: RE: Parser passes garbage to characters() callback for XML
> containing character entities
> 
> For Xerces 2.9.1, did you add Xerces to your runtime through the Java
> endorsed mechanism [1]?
> 
> Gary
> 
> [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/
> 
> 
> > -Original Message-
> > From: Thomas Schleu [mailto:tsch...@canto.com]
> > Sent: Friday, January 22, 2010 05:29
> > To: j-users@xerces.apache.org
> > Subject: Parser passes garbage to characters() callback for XML
> > containing character entities
> >
> > I can reproduce a problem parsing certain XML 1.1 files that contain
> > lots of
> > character entities (escaped control chars like "").
> > At some point in the file the parser calls my characters() method
> with
> > garbage text.
> >
> > Here is the source code that generates such an XML file:
> >
> > FileOutputStream fos = new FileOutputStream (new File
> > ("C:/test.xml"));
> > fos.write ("\n > X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes
> > ("UTF-8"));
> > final byte[] bytes =
> > ("abcdefghijklmnopqrstuvwxyz\n").getBytes
> > ("UTF-8");
> > for (int i = 0; i < 314; i++)
> > {
> > fos.write(bytes);
> > }
> > fos.write ("".getBytes ("UTF-8"));
> > fos.close ();
> >
> > The XML is very simple, it just  contains lots of identical elements
> > with
> > "" in the body text.
> > The parsing code looks like the following:
> >
> > FileInputStream fis = new FileInputStream (new File
> > ("C:/test.xml"));
> > final SAXParserFactory saxParserFactory =
> > SAXParserFactory.newInstance
> > ();
> > saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespaces";,
> > Boolean.TRUE);
> > saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE);
> > final SAXParser parser = saxParserFactory.newSAXParser ();
> > try
> > {
> > parser.parse (fis, new DefaultHandler()
> > {
> > StringBuilder sb = new StringBuilder ();
> > String currentElement = null;
> >
> > public void startElement (String uri, String localName,
> > String
> > qName, Attributes attributes) throws SAXException
> > {
> > currentElement = localName;
> > }
> > public void characters (char ch[], int start, int length)
> > throws
> > SAXException
> > {
> > if ("item".equals (currentElement))
> > {
> > String s = new String (ch, start, length);
> > if (sb.length () == 0 && !s.startsWith ("abc"))
> > {
> > // THE PARSER CALLS ME WITH GARBAGE!
> > System.out.println ("ERROR");
> > }
> > sb.append (s);
> > }
> > }
> > public void endElement (String uri, String localName,
> > String
> > qName) throws SAXException
> > {
> > if ("item"

RE: Parser passes garbage to characters() callback for XML containing character entities

2010-01-22 Thread Michael Glavassevich

Thomas,

Also check that you're not falling into the typical usage errors with SAX.
I see at least one issue in your test program (i.e. the s.startsWith("abc")
call in characters()). Take a look at this FAQ [1].

Thanks.

[1] http://xerces.apache.org/xerces2-j/faq-sax.html#faq-2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Gary Gregory  wrote on 01/22/2010 01:57:08
PM:

> For Xerces 2.9.1, did you add Xerces to your runtime through the
> Java endorsed mechanism [1]?
>
> Gary
>
> [1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/
>
>
> > -Original Message-
> > From: Thomas Schleu [mailto:tsch...@canto.com]
> > Sent: Friday, January 22, 2010 05:29
> > To: j-users@xerces.apache.org
> > Subject: Parser passes garbage to characters() callback for XML
> > containing character entities
> >
> > I can reproduce a problem parsing certain XML 1.1 files that contain
> > lots of
> > character entities (escaped control chars like "").
> > At some point in the file the parser calls my characters() method with
> > garbage text.
> >
> > Here is the source code that generates such an XML file:
> >
> > FileOutputStream fos = new FileOutputStream (new File
> > ("C:/test.xml"));
> > fos.write ("\n > X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes
> > ("UTF-8"));
> > final byte[] bytes =
> > ("abcdefghijklmnopqrstuvwxyz\n").getBytes
> > ("UTF-8");
> > for (int i = 0; i < 314; i++)
> > {
> > fos.write(bytes);
> > }
> > fos.write ("".getBytes ("UTF-8"));
> > fos.close ();
> >
> > The XML is very simple, it just  contains lots of identical elements
> > with
> > "" in the body text.
> > The parsing code looks like the following:
> >
> > FileInputStream fis = new FileInputStream (new File
> > ("C:/test.xml"));
> > final SAXParserFactory saxParserFactory =
> > SAXParserFactory.newInstance
> > ();
> > saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespaces";,
> > Boolean.TRUE);
> > saxParserFactory.setFeature
> > ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE);
> > final SAXParser parser = saxParserFactory.newSAXParser ();
> > try
> > {
> > parser.parse (fis, new DefaultHandler()
> > {
> > StringBuilder sb = new StringBuilder ();
> > String currentElement = null;
> >
> > public void startElement (String uri, String localName,
> > String
> > qName, Attributes attributes) throws SAXException
> > {
> > currentElement = localName;
> > }
> > public void characters (char ch[], int start, int length)
> > throws
> > SAXException
> > {
> > if ("item".equals (currentElement))
> > {
> > String s = new String (ch, start, length);
> > if (sb.length () == 0 && !s.startsWith ("abc"))
> > {
> > // THE PARSER CALLS ME WITH GARBAGE!
> > System.out.println ("ERROR");
> > }
> > sb.append (s);
> > }
> > }
> > public void endElement (String uri, String localName,
> > String
> > qName) throws SAXException
> > {
> > if ("item".equals (localName))
> > {
> > sb.delete (0, sb.length ());
> > currentElement = null;
> > }
> > }
> > });
> > }
> > catch (Exception e)
> > {
> > e.printStackTrace ();
> > System.out.println ("e = " + e);
> > }
> >
> > My characters() method checks whether the body text is the expected
> > text
> > starting with "abc".
> > After 156 elements with the correct body text my method gets called
> > with the
> > text "x19; > element.
> > The XML code has to exceed 16kB to show this problem. It may be related
> > to
> > the 8kB internal buffer of the parser.
> >
> > I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as
> > well
> > as with the separate xerces-2_9_1. All show the same behavior.
> > I cannot work around this as I don't have control over the XML input.
> >
> > Anyone who can help me here?
> >
> > Thanks in Advance
> > Thomas Schleu
> > Chief Technology Officer
> >
> > Mail: mailto:tsch...@canto.com
> > Fon:  +49-30-390 485 0
> > Fax:  +49-30-390 485 55
> >
> > Canto GmbH
> > Alt-Moabit 73
> > D-10555 Berlin
> > Germany
> > http://www.canto.com
> > Amtsgericht Berlin-Charlottenburg HRB 88566
> > Geschäftsführer: Hans-Dieter Schädel
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> > For additional commands, e-mail: j-users-h...@xerces.apache.org
>
>
> -
> To unsubscrib

RE: Parser passes garbage to characters() callback for XML containing character entities

2010-01-22 Thread Gary Gregory
For Xerces 2.9.1, did you add Xerces to your runtime through the Java endorsed 
mechanism [1]?

Gary

[1] http://java.sun.com/j2se/1.4.2/docs/guide/standards/
 

> -Original Message-
> From: Thomas Schleu [mailto:tsch...@canto.com]
> Sent: Friday, January 22, 2010 05:29
> To: j-users@xerces.apache.org
> Subject: Parser passes garbage to characters() callback for XML
> containing character entities
> 
> I can reproduce a problem parsing certain XML 1.1 files that contain
> lots of
> character entities (escaped control chars like "").
> At some point in the file the parser calls my characters() method with
> garbage text.
> 
> Here is the source code that generates such an XML file:
> 
> FileOutputStream fos = new FileOutputStream (new File
> ("C:/test.xml"));
> fos.write ("\n X>\nhttp://www.mycompany.com/ns/X/1.0\";>\n".getBytes
> ("UTF-8"));
> final byte[] bytes =
> ("abcdefghijklmnopqrstuvwxyz\n").getBytes
> ("UTF-8");
> for (int i = 0; i < 314; i++)
> {
> fos.write(bytes);
> }
> fos.write ("".getBytes ("UTF-8"));
> fos.close ();
> 
> The XML is very simple, it just  contains lots of identical elements
> with
> "" in the body text.
> The parsing code looks like the following:
> 
> FileInputStream fis = new FileInputStream (new File
> ("C:/test.xml"));
> final SAXParserFactory saxParserFactory =
> SAXParserFactory.newInstance
> ();
> saxParserFactory.setFeature
> ("http://xml.org/sax/features/namespaces";,
> Boolean.TRUE);
> saxParserFactory.setFeature
> ("http://xml.org/sax/features/namespace-prefixes";, Boolean.TRUE);
> final SAXParser parser = saxParserFactory.newSAXParser ();
> try
> {
> parser.parse (fis, new DefaultHandler()
> {
> StringBuilder sb = new StringBuilder ();
> String currentElement = null;
> 
> public void startElement (String uri, String localName,
> String
> qName, Attributes attributes) throws SAXException
> {
> currentElement = localName;
> }
> public void characters (char ch[], int start, int length)
> throws
> SAXException
> {
> if ("item".equals (currentElement))
> {
> String s = new String (ch, start, length);
> if (sb.length () == 0 && !s.startsWith ("abc"))
> {
> // THE PARSER CALLS ME WITH GARBAGE!
> System.out.println ("ERROR");
> }
> sb.append (s);
> }
> }
> public void endElement (String uri, String localName,
> String
> qName) throws SAXException
> {
> if ("item".equals (localName))
> {
> sb.delete (0, sb.length ());
> currentElement = null;
> }
> }
> });
> }
> catch (Exception e)
> {
> e.printStackTrace ();
> System.out.println ("e = " + e);
> }
> 
> My characters() method checks whether the body text is the expected
> text
> starting with "abc".
> After 156 elements with the correct body text my method gets called
> with the
> text "x19; element.
> The XML code has to exceed 16kB to show this problem. It may be related
> to
> the 8kB internal buffer of the parser.
> 
> I tested with the parser shipping with jdk1.5.0_19, and jdk1.6.0_14 as
> well
> as with the separate xerces-2_9_1. All show the same behavior.
> I cannot work around this as I don't have control over the XML input.
> 
> Anyone who can help me here?
> 
> Thanks in Advance
> Thomas Schleu
> Chief Technology Officer
> 
> Mail: mailto:tsch...@canto.com
> Fon:  +49-30-390 485 0
> Fax:  +49-30-390 485 55
> 
> Canto GmbH
> Alt-Moabit 73
> D-10555 Berlin
> Germany
> http://www.canto.com
> Amtsgericht Berlin-Charlottenburg HRB 88566
> Geschäftsführer: Hans-Dieter Schädel
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org


-
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org



RE: Parser passes garbage to characters() callback for XML containing character entities

2010-01-22 Thread Gary Gregory
FWIW, the page http://xerces.apache.org/#xerces2-j claims "XML 1.1 (2nd 
Edition)"

Gary

From: kesh...@us.ibm.com [mailto:kesh...@us.ibm.com]
Sent: Friday, January 22, 2010 10:46
To: j-users@xerces.apache.org
Cc: j-users@xerces.apache.org
Subject: Re: Parser passes garbage to characters() callback for XML containing 
character entities

> I can reproduce a problem parsing certain XML 1.1 files that contain lots of
> character entities (escaped control chars like "").
> At some point in the file the parser calls my characters() method with
> garbage text.

In XML 1.0, most control characters were simply illegal. Did we ever update the 
Apache code to handle XML 1.1?


Re: Parser passes garbage to characters() callback for XML containing character entities

2010-01-22 Thread Michael Glavassevich
kesh...@us.ibm.com wrote on 01/22/2010 01:46:10 PM:

> > I can reproduce a problem parsing certain XML 1.1 files that contain
lots of
> > character entities (escaped control chars like "").
> > At some point in the file the parser calls my characters() method with
> > garbage text.
>
> In XML 1.0, most control characters were simply illegal. Did we ever
> update the Apache code to handle XML 1.1?

Yes, since 2003 and perhaps even earlier than that.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Re: Parser passes garbage to characters() callback for XML containing character entities

2010-01-22 Thread keshlam
> I can reproduce a problem parsing certain XML 1.1 files that contain 
lots of
> character entities (escaped control chars like "").
> At some point in the file the parser calls my characters() method with
> garbage text.

In XML 1.0, most control characters were simply illegal. Did we ever 
update the Apache code to handle XML 1.1?