Re: Exception thrown during CAS serialization for Remote UIMA-AS Service

2012-06-14 Thread Jörn Kottmann

You write a string to the CAS which contains a non-xml character.
This character cannot be serialized into XMI, and thats what this 
exception is about.


Have a look at our documentation explaining the issue:
http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues

Hope that helps,
Jörn

On 06/14/2012 11:39 PM, Thomas Ginter wrote:

We are getting an odd error while trying to process large datasets using 
UIMA-AS 2.3.1.  There is an exception thrown by the XmiCasSerializer in the 
Client when it is in the process of serializing a CAS to be sent to a remote 
service.  The exception is as follows:

org.apache.uima.resource.ResourceProcessException
   at 
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:854)
   at 
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:885)
   at 
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.process(BaseUIMAAsynchronousEngineCommon_impl.java:734)
   at gov.va.vinci.flap.Client.run(Client.java:181)
   at gov.va.vinci.density.DensityClient.main(DensityClient.java:137)
Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 
character: _, 0x1a
   at 
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
   at 
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
   at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
   at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
   at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
   at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
   at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
   at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1539)
   at 
org.apache.uima.aae.UimaSerializer.serializeCasToXmi(UimaSerializer.java:136)
   at 
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.serializeCAS(BaseUIMAAsynchronousEngineCommon_impl.java:260)
   at 
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:779)
   ... 4 more

It happens at apparently random points when processing the corpus and is never actually 
thrown but is simply written to StdErr.  Also the serializer never seems to return 
which means the UimaAsynchronoousEngine.process() method never returns and the client simply 
hangs until it is manually terminated.  To resolve this issue I have implemented text 
filters for the incoming CAS data to prevent anything out of the ASCII-8 range.  I have also tried 
switching the server and client to binary serialization strategies but that causes the 
XmiCasSerializer in my UimaAsBaseListener object to return errors attempting to serialize CAS 
objects revieved in the entityProcessingComplete event.

Any suggestions from the UIMA masters?  How can I debug further so that I can 
find out A: Where is this illegal character coming from and B: How can I 
prevent it from happening?

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edumailto:thomas.gin...@utah.edu









Re: Exception thrown during CAS serialization for Remote UIMA-AS Service

2012-06-14 Thread Thomas Ginter
Jorn,

Thanks for the link to that section of documentation.  The mention of the 
XMLUtils class was just what I needed.  I wrote an XmlFilter class that uses 
XMLUtils to detect invalid XML characters and replace them with spaces so that 
our annotation offsets will still match the original text.  I was thinking 
about the issue all wrong.  I was assuming that all ASCII-8 characters are also 
valid XML-1.0 characters.

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu




On Jun 14, 2012, at 3:52 PM, Jörn Kottmann wrote:

 You write a string to the CAS which contains a non-xml character.
 This character cannot be serialized into XMI, and thats what this exception 
 is about.
 
 Have a look at our documentation explaining the issue:
 http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues
 
 Hope that helps,
 Jörn
 
 On 06/14/2012 11:39 PM, Thomas Ginter wrote:
 We are getting an odd error while trying to process large datasets using 
 UIMA-AS 2.3.1.  There is an exception thrown by the XmiCasSerializer in the 
 Client when it is in the process of serializing a CAS to be sent to a remote 
 service.  The exception is as follows:
 
 org.apache.uima.resource.ResourceProcessException
   at 
 org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:854)
   at 
 org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:885)
   at 
 org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.process(BaseUIMAAsynchronousEngineCommon_impl.java:734)
   at gov.va.vinci.flap.Client.run(Client.java:181)
   at gov.va.vinci.density.DensityClient.main(DensityClient.java:137)
 Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 
 character: _, 0x1a
   at 
 org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
   at 
 org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
   at 
 org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1539)
   at 
 org.apache.uima.aae.UimaSerializer.serializeCasToXmi(UimaSerializer.java:136)
   at 
 org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.serializeCAS(BaseUIMAAsynchronousEngineCommon_impl.java:260)
   at 
 org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:779)
   ... 4 more
 
 It happens at apparently random points when processing the corpus and is 
 never actually thrown but is simply written to StdErr.  Also the 
 serializer never seems to return which means the 
 UimaAsynchronoousEngine.process() method never returns and the client simply 
 hangs until it is manually terminated.  To resolve this issue I have 
 implemented text filters for the incoming CAS data to prevent anything out 
 of the ASCII-8 range.  I have also tried switching the server and client to 
 binary serialization strategies but that causes the XmiCasSerializer in my 
 UimaAsBaseListener object to return errors attempting to serialize CAS 
 objects revieved in the entityProcessingComplete event.
 
 Any suggestions from the UIMA masters?  How can I debug further so that I 
 can find out A: Where is this illegal character coming from and B: How can I 
 prevent it from happening?
 
 Thanks,
 
 Thomas Ginter
 801-448-7676
 thomas.gin...@utah.edumailto:thomas.gin...@utah.edu