Jorn,

Thanks for the link to that section of documentation.  The mention of the 
XMLUtils class was just what I needed.  I wrote an XmlFilter class that uses 
XMLUtils to detect invalid XML characters and replace them with spaces so that 
our annotation offsets will still match the original text.  I was thinking 
about the issue all wrong.  I was assuming that all ASCII-8 characters are also 
valid XML-1.0 characters.

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu




On Jun 14, 2012, at 3:52 PM, Jörn Kottmann wrote:

> You write a string to the CAS which contains a non-xml character.
> This character cannot be serialized into XMI, and thats what this exception 
> is about.
> 
> Have a look at our documentation explaining the issue:
> http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues
> 
> Hope that helps,
> Jörn
> 
> On 06/14/2012 11:39 PM, Thomas Ginter wrote:
>> We are getting an odd error while trying to process large datasets using 
>> UIMA-AS 2.3.1.  There is an exception thrown by the XmiCasSerializer in the 
>> Client when it is in the process of serializing a CAS to be sent to a remote 
>> service.  The exception is as follows:
>> 
>> org.apache.uima.resource.ResourceProcessException
>>       at 
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:854)
>>       at 
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:885)
>>       at 
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.process(BaseUIMAAsynchronousEngineCommon_impl.java:734)
>>       at gov.va.vinci.flap.Client.run(Client.java:181)
>>       at gov.va.vinci.density.DensityClient.main(DensityClient.java:137)
>> Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 
>> character: _, 0x1a
>>       at 
>> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254)
>>       at 
>> org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
>>       at 
>> org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1539)
>>       at 
>> org.apache.uima.aae.UimaSerializer.serializeCasToXmi(UimaSerializer.java:136)
>>       at 
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.serializeCAS(BaseUIMAAsynchronousEngineCommon_impl.java:260)
>>       at 
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:779)
>>       ... 4 more
>> 
>> It happens at apparently random points when processing the corpus and is 
>> never actually "thrown" but is simply written to StdErr.  Also the 
>> serializer never seems to return which means the 
>> UimaAsynchronoousEngine.process() method never returns and the client simply 
>> "hangs" until it is manually terminated.  To resolve this issue I have 
>> implemented text filters for the incoming CAS data to prevent anything out 
>> of the ASCII-8 range.  I have also tried switching the server and client to 
>> binary serialization strategies but that causes the XmiCasSerializer in my 
>> UimaAsBaseListener object to return errors attempting to serialize CAS 
>> objects revieved in the entityProcessingComplete event.
>> 
>> Any suggestions from the UIMA masters?  How can I debug further so that I 
>> can find out A: Where is this illegal character coming from and B: How can I 
>> prevent it from happening?
>> 
>> Thanks,
>> 
>> Thomas Ginter
>> 801-448-7676
>> thomas.gin...@utah.edu<mailto:thomas.gin...@utah.edu>
>> 
>> 
>> 
>> 
>> 
> 

Reply via email to