Jorn,
Thanks for the link to that section of documentation. The mention of the
XMLUtils class was just what I needed. I wrote an XmlFilter class that uses
XMLUtils to detect invalid XML characters and replace them with spaces so that
our annotation offsets will still match the original text. I was thinking
about the issue all wrong. I was assuming that all ASCII-8 characters are also
valid XML-1.0 characters.
Thanks,
Thomas Ginter
801-448-7676
thomas.gin...@utah.edu
On Jun 14, 2012, at 3:52 PM, Jörn Kottmann wrote:
You write a string to the CAS which contains a non-xml character.
This character cannot be serialized into XMI, and thats what this exception
is about.
Have a look at our documentation explaining the issue:
http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues
Hope that helps,
Jörn
On 06/14/2012 11:39 PM, Thomas Ginter wrote:
We are getting an odd error while trying to process large datasets using
UIMA-AS 2.3.1. There is an exception thrown by the XmiCasSerializer in the
Client when it is in the process of serializing a CAS to be sent to a remote
service. The exception is as follows:
org.apache.uima.resource.ResourceProcessException
at
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:854)
at
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:885)
at
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.process(BaseUIMAAsynchronousEngineCommon_impl.java:734)
at gov.va.vinci.flap.Client.run(Client.java:181)
at gov.va.vinci.density.DensityClient.main(DensityClient.java:137)
Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0
character: _, 0x1a
at
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
at
org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
at
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1539)
at
org.apache.uima.aae.UimaSerializer.serializeCasToXmi(UimaSerializer.java:136)
at
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.serializeCAS(BaseUIMAAsynchronousEngineCommon_impl.java:260)
at
org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:779)
... 4 more
It happens at apparently random points when processing the corpus and is
never actually thrown but is simply written to StdErr. Also the
serializer never seems to return which means the
UimaAsynchronoousEngine.process() method never returns and the client simply
hangs until it is manually terminated. To resolve this issue I have
implemented text filters for the incoming CAS data to prevent anything out
of the ASCII-8 range. I have also tried switching the server and client to
binary serialization strategies but that causes the XmiCasSerializer in my
UimaAsBaseListener object to return errors attempting to serialize CAS
objects revieved in the entityProcessingComplete event.
Any suggestions from the UIMA masters? How can I debug further so that I
can find out A: Where is this illegal character coming from and B: How can I
prevent it from happening?
Thanks,
Thomas Ginter
801-448-7676
thomas.gin...@utah.edumailto:thomas.gin...@utah.edu