I think there's another problem ... the characters we have tested with are
not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
(F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
BD) is xFFFD.  This last one is the "replacement character" used when an
illegal character is encountered.  According to Wikipedia the 88 Bamum
characters are in the range xA6A0 - xA6F7.

In order to reproduce your problem we need to yse the same codepoints.  Can
you tell us what the hex value of the failing characters are, in UTF-8 or
UTF-!6?

By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE,
following the quick test described in the UIMA-AS README.

On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <m...@schor.com> wrote:

> Maybe we've been on the wrong line of thinking.
>
> Perhaps the translation between UTF-8 (during transportation) and the
> string
> characters is fine, but the XML parsing is restricting the character set
> it uses.
>
> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>
> where it says valid xml characters exclude the "surrogates", which your
> characters I think are.
>
> So, perhaps it's XML parsing which is complaining (and it appears this is
> so,
> from the stack trace).
>
> We should point out that UIMA's character offsets (like begin an end) were
> designed with Java String character offsets, and will perhaps not work
> correctly
> when surrogates are being used.
>
> A possible workaround for this particular issue may be to switch to binary
> serialization, instead of xmi serialization. This has a restriction in
> that the
> type systems much be identical (between the client and server).
>
> We could possibly get more confirmation of this hypothesis if you could
> say what
> the stack trace was, beyond the first bit which you stated in your original
> note.  There should be more stack trace information, further down,
> starting with
> "caused by ..." which may provide more helpful information.
>
> -Marshall
>
>
> On 12/14/2016 9:38 AM, nelson rivera wrote:
> > We also did that test with uima framework and RunAE tool and
> > thecharacters in a file as you, and effectively not exist problem. The
> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> > when trying to deserialize the cas deserializeCasFromXmi() in remote
> > uima-as service, that  i get the mentioned exception
> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> > Character reference "&#"
> >
> > In my case i don't read any file, not use FileSystemCollectionReader.
> > The user introduces the text, the text is stored in string java
> > (utf-16) and it set to the cas that will be processing, using
> > setDocumentLanguage, then i send the cas.
> >
> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnle...@gmail.com>:
> >> I put these 3 characters as UTF-8 in a file in examples/data and ran the
> >> MeetingDetector annotator as described in section 3.4 of the README,
> adding
> >> the option "-o out".  In that folder I found the returned results in xmi
> >> format with the characters in the sofaString element.  The relevant
> part of
> >> this file in hex is:
> >>
> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".........
> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..&#10;"/><cas:V
> >>
> >> Note that the FileSystemCollectionReader by default uses the system
> >> encoding but you could add a ConfigurationParameterSetting of UTF-8 for
> the
> >> Encoding parameter in its descriptor.
> >>
> >> With the client & server on different (Linux) machines I see no problem
> >> with sending UTF-8 characters.
> >>
> >>
> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <m...@schor.com> wrote:
> >>
> >>> another question:  I assume there are perhaps 2 machines involved, here
> >>> (it's a
> >>> UIMA-AS setup).
> >>>
> >>> From the exception, it appears that the error happen when the client
> >>> sends
> >>> the
> >>> CAS to the remote.
> >>>
> >>> Can you print out the Linux (assuming that's the OS) default locale for
> >>> both
> >>> machines?  (e.g. type into a command line "locale" and see what each
> >>> machines
> >>> has as its default character encoding).
> >>>
> >>> Please let us know what these are.
> >>>
> >>> Thanks. -Marshall
> >>>
> >>>
> >>>
> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
> >>>> Yes these are the values of the troublesome characters, using
> >>>> Integer.toHexString() to print out each byte, shows
> >>>>
> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
> >>>>
> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
> >>>>
> >>>> ffffffef ffffffbf ffffffbd
> >>>>
> >>>> ffffffef ffffffbf ffffffbd
> >>>>
> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <m...@schor.com>:
> >>>>> Hi Nelson,
> >>>>>
> >>>>> Looking into this... Can you please confirm that the UTF-8 coding of
> >>>>> the
> >>>>> troublesome characters, in hexadecimal, is:
> >>>>>
> >>>>> F0 96 A6 80
> >>>>>
> >>>>> F0 96 A6 90
> >>>>>
> >>>>> EF BF BD
> >>>>>
> >>>>> EF BF BD
> >>>>>
> >>>>> If you have the string in Java, please try converting it to a UTF-8
> >>> string
> >>>>> using
> >>>>> something like:
> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
> >>>>>
> >>>>>   and then print out theBytes in hex; they should look like the
> above.
> >>> If
> >>>>> not,
> >>>>> please let us know what the values is instead.
> >>>>>
> >>>>>
> >>>>> Thanks. -Marshall
> >>>>>
> >>>>>
> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
> >>>>>> Hi i was read your explication and saw the link, but in my case, i
> >>>>>> don't read any xml file. Just i copy the text, get a new input cas
> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in the cas
> >>>>>> and
> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the
> >>>>>> client
> >>>>>> side. Apparently the characters are changed for its entities
> >>>>>> corresponding when serialize the cas to send it, but i get the
> >>>>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
> >>>>>> columnNumber: 571; Character reference "&#"
> >>>>>> in uima-as framework installed when trying to deserialize the cas
> >>>>>> deserializeCasFromXmi(),to be processed for the service.
> >>>>>>
> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <m...@schor.com>:
> >>>>>>> Hi Nelson,
> >>>>>>>
> >>>>>>> I can't see the characters (sorry).
> >>>>>>>
> >>>>>>> This might be an issue caused by a discrepancy between the coding
> of
> >>> the
> >>>>>>> file
> >>>>>>> being read, and the coding indicated on the xml header.  Can you
> >>>>>>> check
> >>>>>>> that
> >>>>>>> those two things are the same?
> >>>>>>>
> >>>>>>> See
> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
> >>> the-encoding-in-the-xml-header
> >>>>>>> for example.
> >>>>>>>
> >>>>>>> -Marshall
> >>>>>>>
> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
> >>>>>>>> i tried to proccess the following text in a service deploy in
> >>> uima-as,
> >>>>>>>> because is input of my application. This is the text : 𖦀  𖦐  �
> >>>>>>>> �.
> >>>>>>>> These characters correspond to the bamun language, and apparently
> >>>>>>>> are
> >>>>>>>> not  invalid xml characters because tools such as browsers
> >>>>>>>> interpret
> >>>>>>>> it and show it. After get a new input cas to proccesing, set the
> >>>>>>>> text
> >>>>>>>> and send the request, i get  the exception that i show below in
> >>>>>>>> uima-as, the framework uima-as work and recovers correctly, just
> >>>>>>>> not
> >>>>>>>> process this characters.
> >>>>>>>> Could you tell me what happens with these characters, one of these
> >>>>>>>> is
> >>>>>>>> invalid characters for framework uima-as?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 04:00:31.606 - 14:
> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >>> handleProcessRequestFromRemoteClient:
> >>>>>>>> WARNING:
> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >>>>>>>> Character reference "&#
> >>>>>>>>         at
> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
> AbstractSAXParser.parse(
> >>> AbstractSAXParser.java:1239)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
> >>> UimaSerializer.java:187)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
> impl.java:222)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
> impl.java:552)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.handle(
> >>> ProcessRequestHandler_impl.java:1090)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
> >>> impl.handle(MetadataRequestHandler_impl.java:78)
> >>>>>>>>         at
> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> >>> onMessage(JmsInputChannel.java:731)
> >>>
>
>

Reply via email to