We also did that test with uima framework and RunAE tool and thecharacters in a file as you, and effectively not exist problem. The problem is use uima-as, sendCAS() with UimaAsynchronousEngine and when trying to deserialize the cas deserializeCasFromXmi() in remote uima-as service, that i get the mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; Character reference "&#"
In my case i don't read any file, not use FileSystemCollectionReader. The user introduces the text, the text is stored in string java (utf-16) and it set to the cas that will be processing, using setDocumentLanguage, then i send the cas. 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnle...@gmail.com>: > I put these 3 characters as UTF-8 in a file in examples/data and ran the > MeetingDetector annotator as described in section 3.4 of the README, adding > the option "-o out". In that folder I found the returned results in xmi > format with the characters in the sofaString element. The relevant part of > this file in hex is: > > 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* tring="......... > 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 .. "/><cas:V > > Note that the FileSystemCollectionReader by default uses the system > encoding but you could add a ConfigurationParameterSetting of UTF-8 for the > Encoding parameter in its descriptor. > > With the client & server on different (Linux) machines I see no problem > with sending UTF-8 characters. > > > On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <m...@schor.com> wrote: > >> another question: I assume there are perhaps 2 machines involved, here >> (it's a >> UIMA-AS setup). >> >> From the exception, it appears that the error happen when the client >> sends >> the >> CAS to the remote. >> >> Can you print out the Linux (assuming that's the OS) default locale for >> both >> machines? (e.g. type into a command line "locale" and see what each >> machines >> has as its default character encoding). >> >> Please let us know what these are. >> >> Thanks. -Marshall >> >> >> >> On 12/12/2016 1:58 PM, nelson rivera wrote: >> > Yes these are the values of the troublesome characters, using >> > Integer.toHexString() to print out each byte, shows >> > >> > fffffff0 ffffff96 ffffffa6 ffffff80 >> > >> > fffffff0 ffffff96 ffffffa6 ffffff90 >> > >> > ffffffef ffffffbf ffffffbd >> > >> > ffffffef ffffffbf ffffffbd >> > >> > 2016-12-12 11:35 GMT-05:00, Marshall Schor <m...@schor.com>: >> >> Hi Nelson, >> >> >> >> Looking into this... Can you please confirm that the UTF-8 coding of >> >> the >> >> troublesome characters, in hexadecimal, is: >> >> >> >> F0 96 A6 80 >> >> >> >> F0 96 A6 90 >> >> >> >> EF BF BD >> >> >> >> EF BF BD >> >> >> >> If you have the string in Java, please try converting it to a UTF-8 >> string >> >> using >> >> something like: >> >> byte[] theBytes = myTestString.getBytes("UTF-8"); >> >> >> >> and then print out theBytes in hex; they should look like the above. >> If >> >> not, >> >> please let us know what the values is instead. >> >> >> >> >> >> Thanks. -Marshall >> >> >> >> >> >> On 12/9/2016 9:02 AM, nelson rivera wrote: >> >>> Hi i was read your explication and saw the link, but in my case, i >> >>> don't read any xml file. Just i copy the text, get a new input cas >> >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas >> >>> and >> >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the >> >>> client >> >>> side. Apparently the characters are changed for its entities >> >>> corresponding when serialize the cas to send it, but i get the >> >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; >> >>> columnNumber: 571; Character reference "&#" >> >>> in uima-as framework installed when trying to deserialize the cas >> >>> deserializeCasFromXmi(),to be processed for the service. >> >>> >> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <m...@schor.com>: >> >>>> Hi Nelson, >> >>>> >> >>>> I can't see the characters (sorry). >> >>>> >> >>>> This might be an issue caused by a discrepancy between the coding of >> the >> >>>> file >> >>>> being read, and the coding indicated on the xml header. Can you >> >>>> check >> >>>> that >> >>>> those two things are the same? >> >>>> >> >>>> See >> >>>> http://stackoverflow.com/questions/5165347/what-use-is- >> the-encoding-in-the-xml-header >> >>>> for example. >> >>>> >> >>>> -Marshall >> >>>> >> >>>> On 12/8/2016 4:20 PM, nelson rivera wrote: >> >>>>> i tried to proccess the following text in a service deploy in >> uima-as, >> >>>>> because is input of my application. This is the text : 𖦀 𖦐 � >> >>>>> �. >> >>>>> These characters correspond to the bamun language, and apparently >> >>>>> are >> >>>>> not invalid xml characters because tools such as browsers >> >>>>> interpret >> >>>>> it and show it. After get a new input cas to proccesing, set the >> >>>>> text >> >>>>> and send the request, i get the exception that i show below in >> >>>>> uima-as, the framework uima-as work and recovers correctly, just >> >>>>> not >> >>>>> process this characters. >> >>>>> Could you tell me what happens with these characters, one of these >> >>>>> is >> >>>>> invalid characters for framework uima-as? >> >>>>> >> >>>>> >> >>>>> >> >>>>> 04:00:31.606 - 14: >> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> handleProcessRequestFromRemoteClient: >> >>>>> WARNING: >> >>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >> >>>>> Character reference "&# >> >>>>> at >> >>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse( >> AbstractSAXParser.java:1239) >> >>>>> at >> >>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( >> UimaSerializer.java:187) >> >>>>> at >> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222) >> >>>>> at >> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. >> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552) >> >>>>> at >> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle( >> ProcessRequestHandler_impl.java:1090) >> >>>>> at >> >>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ >> impl.handle(MetadataRequestHandler_impl.java:78) >> >>>>> at >> >>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. >> onMessage(JmsInputChannel.java:731) >> >>>>> >> >> >> >> >