Sorry, I missed the supplement set. So the tests I did with x16980 & x16990 are valid. runRemoteAsyncAE uses the same FileSystemCollectionReader as runAE does ... did you use a different collection reader? If a custom one perhaps you could serialize the cas to a file as XMI and verify that the XMI is legal.
On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera <nelsonriver...@gmail.com> wrote: > In Wikipedia the Bamum > Script(https://en.wikipedia.org/wiki/Bamum_script) contain another > valid range is U+16800–U+16A3F, any of theses characters generate the > same log trace. I will continue to test the Marshall Schor > suggestion. > > 2016-12-14 18:07 GMT-05:00, Burn Lewis <burnle...@gmail.com>: > > I think there's another problem ... the characters we have tested with > are > > not in the Bamum unicode set. The first 2 that Marshall listed in utf-8 > > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF > > BD) is xFFFD. This last one is the "replacement character" used when an > > illegal character is encountered. According to Wikipedia the 88 Bamum > > characters are in the range xA6A0 - xA6F7. > > > > In order to reproduce your problem we need to yse the same codepoints. > Can > > you tell us what the hex value of the failing characters are, in UTF-8 or > > UTF-!6? > > > > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not > runAE, > > following the quick test described in the UIMA-AS README. > > > > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <m...@schor.com> wrote: > > > >> Maybe we've been on the wrong line of thinking. > >> > >> Perhaps the translation between UTF-8 (during transportation) and the > >> string > >> characters is fine, but the XML parsing is restricting the character set > >> it uses. > >> > >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML > >> > >> where it says valid xml characters exclude the "surrogates", which your > >> characters I think are. > >> > >> So, perhaps it's XML parsing which is complaining (and it appears this > is > >> so, > >> from the stack trace). > >> > >> We should point out that UIMA's character offsets (like begin an end) > >> were > >> designed with Java String character offsets, and will perhaps not work > >> correctly > >> when surrogates are being used. > >> > >> A possible workaround for this particular issue may be to switch to > >> binary > >> serialization, instead of xmi serialization. This has a restriction in > >> that the > >> type systems much be identical (between the client and server). > >> > >> We could possibly get more confirmation of this hypothesis if you could > >> say what > >> the stack trace was, beyond the first bit which you stated in your > >> original > >> note. There should be more stack trace information, further down, > >> starting with > >> "caused by ..." which may provide more helpful information. > >> > >> -Marshall > >> > >> > >> On 12/14/2016 9:38 AM, nelson rivera wrote: > >> > We also did that test with uima framework and RunAE tool and > >> > thecharacters in a file as you, and effectively not exist problem. The > >> > problem is use uima-as, sendCAS() with UimaAsynchronousEngine and > >> > when trying to deserialize the cas deserializeCasFromXmi() in remote > >> > uima-as service, that i get the mentioned exception > >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; > >> > Character reference "&#" > >> > > >> > In my case i don't read any file, not use FileSystemCollectionReader. > >> > The user introduces the text, the text is stored in string java > >> > (utf-16) and it set to the cas that will be processing, using > >> > setDocumentLanguage, then i send the cas. > >> > > >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnle...@gmail.com>: > >> >> I put these 3 characters as UTF-8 in a file in examples/data and ran > >> >> the > >> >> MeetingDetector annotator as described in section 3.4 of the README, > >> adding > >> >> the option "-o out". In that folder I found the returned results in > >> >> xmi > >> >> format with the characters in the sofaString element. The relevant > >> part of > >> >> this file in hex is: > >> >> > >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* tring="......... > >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 .. "/><cas:V > >> >> > >> >> Note that the FileSystemCollectionReader by default uses the system > >> >> encoding but you could add a ConfigurationParameterSetting of UTF-8 > >> >> for > >> the > >> >> Encoding parameter in its descriptor. > >> >> > >> >> With the client & server on different (Linux) machines I see no > >> >> problem > >> >> with sending UTF-8 characters. > >> >> > >> >> > >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <m...@schor.com> > wrote: > >> >> > >> >>> another question: I assume there are perhaps 2 machines involved, > >> >>> here > >> >>> (it's a > >> >>> UIMA-AS setup). > >> >>> > >> >>> From the exception, it appears that the error happen when the client > >> >>> sends > >> >>> the > >> >>> CAS to the remote. > >> >>> > >> >>> Can you print out the Linux (assuming that's the OS) default locale > >> >>> for > >> >>> both > >> >>> machines? (e.g. type into a command line "locale" and see what each > >> >>> machines > >> >>> has as its default character encoding). > >> >>> > >> >>> Please let us know what these are. > >> >>> > >> >>> Thanks. -Marshall > >> >>> > >> >>> > >> >>> > >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote: > >> >>>> Yes these are the values of the troublesome characters, using > >> >>>> Integer.toHexString() to print out each byte, shows > >> >>>> > >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80 > >> >>>> > >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90 > >> >>>> > >> >>>> ffffffef ffffffbf ffffffbd > >> >>>> > >> >>>> ffffffef ffffffbf ffffffbd > >> >>>> > >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <m...@schor.com>: > >> >>>>> Hi Nelson, > >> >>>>> > >> >>>>> Looking into this... Can you please confirm that the UTF-8 coding > >> >>>>> of > >> >>>>> the > >> >>>>> troublesome characters, in hexadecimal, is: > >> >>>>> > >> >>>>> F0 96 A6 80 > >> >>>>> > >> >>>>> F0 96 A6 90 > >> >>>>> > >> >>>>> EF BF BD > >> >>>>> > >> >>>>> EF BF BD > >> >>>>> > >> >>>>> If you have the string in Java, please try converting it to a > UTF-8 > >> >>> string > >> >>>>> using > >> >>>>> something like: > >> >>>>> byte[] theBytes = myTestString.getBytes("UTF-8"); > >> >>>>> > >> >>>>> and then print out theBytes in hex; they should look like the > >> above. > >> >>> If > >> >>>>> not, > >> >>>>> please let us know what the values is instead. > >> >>>>> > >> >>>>> > >> >>>>> Thanks. -Marshall > >> >>>>> > >> >>>>> > >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote: > >> >>>>>> Hi i was read your explication and saw the link, but in my case, > i > >> >>>>>> don't read any xml file. Just i copy the text, get a new input > cas > >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in the > cas > >> >>>>>> and > >> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the > >> >>>>>> client > >> >>>>>> side. Apparently the characters are changed for its entities > >> >>>>>> corresponding when serialize the cas to send it, but i get the > >> >>>>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: > 1; > >> >>>>>> columnNumber: 571; Character reference "&#" > >> >>>>>> in uima-as framework installed when trying to deserialize the cas > >> >>>>>> deserializeCasFromXmi(),to be processed for the service. > >> >>>>>> > >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <m...@schor.com>: > >> >>>>>>> Hi Nelson, > >> >>>>>>> > >> >>>>>>> I can't see the characters (sorry). > >> >>>>>>> > >> >>>>>>> This might be an issue caused by a discrepancy between the > coding > >> of > >> >>> the > >> >>>>>>> file > >> >>>>>>> being read, and the coding indicated on the xml header. Can you > >> >>>>>>> check > >> >>>>>>> that > >> >>>>>>> those two things are the same? > >> >>>>>>> > >> >>>>>>> See > >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is- > >> >>> the-encoding-in-the-xml-header > >> >>>>>>> for example. > >> >>>>>>> > >> >>>>>>> -Marshall > >> >>>>>>> > >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote: > >> >>>>>>>> i tried to proccess the following text in a service deploy in > >> >>> uima-as, > >> >>>>>>>> because is input of my application. This is the text : 𖦀 𖦐 > � > >> >>>>>>>> �. > >> >>>>>>>> These characters correspond to the bamun language, and > >> >>>>>>>> apparently > >> >>>>>>>> are > >> >>>>>>>> not invalid xml characters because tools such as browsers > >> >>>>>>>> interpret > >> >>>>>>>> it and show it. After get a new input cas to proccesing, set > the > >> >>>>>>>> text > >> >>>>>>>> and send the request, i get the exception that i show below in > >> >>>>>>>> uima-as, the framework uima-as work and recovers correctly, > just > >> >>>>>>>> not > >> >>>>>>>> process this characters. > >> >>>>>>>> Could you tell me what happens with these characters, one of > >> >>>>>>>> these > >> >>>>>>>> is > >> >>>>>>>> invalid characters for framework uima-as? > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> 04:00:31.606 - 14: > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> handleProcessRequestFromRemoteClient: > >> >>>>>>>> WARNING: > >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: > 571; > >> >>>>>>>> Character reference "&# > >> >>>>>>>> at > >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers. > >> AbstractSAXParser.parse( > >> >>> AbstractSAXParser.java:1239) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( > >> >>> UimaSerializer.java:187) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_ > >> impl.java:222) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_ > >> impl.java:552) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > >> impl.handle( > >> >>> ProcessRequestHandler_impl.java:1090) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ > >> >>> impl.handle(MetadataRequestHandler_impl.java:78) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. > >> >>> onMessage(JmsInputChannel.java:731) > >> >>> > >> > >> > > >