Re: Proccesing Bamun characters

2016-12-27 Thread nelson rivera
After time of investigation, i found the root cause. The reason was
that i had Xalan library in my classpath and according
javax.xml.transform.newInstance() and ordered lookup procedure, uses
this Xalan implementation of  TransformerFactory. XmiCasSerializer
mechanism also use the SAXTransformerFactory that extends of
TransformerFactory.
Change to system-default implementation, specifying
"javax.xml.transform.TransformerFactory" system property, get the
expected results, the complete entities of the  unicode supplementary
characters when the input CAS is serialized, instead of the entities
of the two surrogates code units  that represents it, and of this way
not occurs any problem of deserialize in uima-as service.
At the end it seem be a bug of the XML transform engine that is used.




2016-12-19 10:03 GMT-05:00, nelson rivera :
> I understand, and yes, these characters should not appear in the
> serialized cas, but they appear using
> XmiCasSerializer.serialize(cas.getCas(), outStream):
>
> ... mimeType="text" sofaString="��  ��  �
> �"/>...
>
> In my application not use FileSystemCollectionReader.
> The user introduces the text, the text is stored in string java
> (utf-16) and it set to the cas that will be processing, using
> setDocumentLanguage, then i send the cas.
>
> 2016-12-18 23:06 GMT-05:00, Burn Lewis :
>> Since these characters are above the basic UTF-16 limit they are
>> represented as 2 UTF-16 characters with high & low surrogate prefixes.
>> So
>> 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
>> prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
>> 6980, and after adding 2*16 (since only characters above this need
>> surrogate pairs) we have the expected x16980.
>> So one mystery is: their appearance in the CAS with the &# notation.
>> When
>> I dump the CAS in the FileSystemCollectionReader I see the UTF-8
>> character,
>> e.g. in hex  f096 a680 f096 a690.
>> What collection reader are you using?
>>
>> On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera 
>> wrote:
>>
>>> This is the cas serialize to xmi before send to uima-as service,
>>> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
>>> The representation of the characters In this serialization does not
>>> match with the representation of characters with problems. It's being
>>> serialized the code points escape sequences corresponding to the Bamum
>>> characters, two code point by each character.
>>> Why can this happen? Any suggestions
>>>
>>> >> xmlns:cas="http:///uima/cas.ecore"; xmlns:xmi="http://www.omg.org/XMI";
>>> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore";
>>> xmlns:tcas="http:///uima/tcas.ecore";
>>> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore";
>>> xmi:version="2.0">>> xmi:id="8" sofa="1" begin="0" end="12"
>>> language="x-unspecified"/>>> sofaID="_InitialView" mimeType="text" sofaString="��
>>> ��  �  �"/>
>>>
>>>
>>> 2016-12-16 14:06 GMT-05:00, Burn Lewis :
>>> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
>>> > x16990 are valid.  runRemoteAsyncAE uses the same
>>> > FileSystemCollectionReader as runAE does ... did you use a different
>>> > collection reader?  If a custom one perhaps you could serialize the
>>> > cas
>>> to
>>> > a file as XMI and verify that the XMI is legal.
>>> >
>>> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera
>>> > >> >
>>> > wrote:
>>> >
>>> >> In Wikipedia the Bamum
>>> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>>> >> valid range is U+16800–U+16A3F, any of theses characters generate the
>>> >> same log trace. I will continue to test the  Marshall Schor
>>> >> suggestion.
>>> >>
>>> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
>>> >> > I think there's another problem ... the characters we have tested
>>> >> > with
>>> >> are
>>> >> > not in the Bamum unicode set.  The first 2 that Marshall listed in
>>> >> > utf-8
>>> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd
>>> >> > (EF
>>> >> > BF
>>> >> > BD) is xFFFD.  This last one is the "replacement character" used
>>> >> > when
>>> >> > an
>>> >> > illegal character is encountered.  According to Wikipedia the 88
>>> >> > Bamum
>>> >> > characters are in the range xA6A0 - xA6F7.
>>> >> >
>>> >> > In order to reproduce your problem we need to yse the same
>>> >> > codepoints.
>>> >> Can
>>> >> > you tell us what the hex value of the failing characters are, in
>>> >> > UTF-8
>>> >> > or
>>> >> > UTF-!6?
>>> >> >
>>> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE,
>>> >> > not
>>> >> runAE,
>>> >> > following the quick test described in the UIMA-AS README.
>>> >> >
>>> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor 
>>> wrote:
>>> >> >
>>> >> >> Maybe we've been on the wrong line of thinking.
>>> >> >>
>>> >> >> Perhaps the translation between UTF-8 (during transportation) and
>>> >> >> the
>>> >> >> string
>>> >> >> characters is fine, but the XML parsing is restri

Re: Proccesing Bamun characters

2016-12-19 Thread nelson rivera
I understand, and yes, these characters should not appear in the
serialized cas, but they appear using
XmiCasSerializer.serialize(cas.getCas(), outStream):

..

In my application not use FileSystemCollectionReader.
The user introduces the text, the text is stored in string java
(utf-16) and it set to the cas that will be processing, using
setDocumentLanguage, then i send the cas.

2016-12-18 23:06 GMT-05:00, Burn Lewis :
> Since these characters are above the basic UTF-16 limit they are
> represented as 2 UTF-16 characters with high & low surrogate prefixes.  So
> 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
> prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
> 6980, and after adding 2*16 (since only characters above this need
> surrogate pairs) we have the expected x16980.
> So one mystery is: their appearance in the CAS with the &# notation.  When
> I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character,
> e.g. in hex  f096 a680 f096 a690.
> What collection reader are you using?
>
> On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera 
> wrote:
>
>> This is the cas serialize to xmi before send to uima-as service,
>> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
>> The representation of the characters In this serialization does not
>> match with the representation of characters with problems. It's being
>> serialized the code points escape sequences corresponding to the Bamum
>> characters, two code point by each character.
>> Why can this happen? Any suggestions
>>
>> > xmlns:cas="http:///uima/cas.ecore"; xmlns:xmi="http://www.omg.org/XMI";
>> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore";
>> xmlns:tcas="http:///uima/tcas.ecore";
>> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore";
>> xmi:version="2.0">> xmi:id="8" sofa="1" begin="0" end="12"
>> language="x-unspecified"/>> sofaID="_InitialView" mimeType="text" sofaString="��
>> ��  �  �"/>
>>
>>
>> 2016-12-16 14:06 GMT-05:00, Burn Lewis :
>> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
>> > x16990 are valid.  runRemoteAsyncAE uses the same
>> > FileSystemCollectionReader as runAE does ... did you use a different
>> > collection reader?  If a custom one perhaps you could serialize the cas
>> to
>> > a file as XMI and verify that the XMI is legal.
>> >
>> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera
>> > > >
>> > wrote:
>> >
>> >> In Wikipedia the Bamum
>> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>> >> valid range is U+16800–U+16A3F, any of theses characters generate the
>> >> same log trace. I will continue to test the  Marshall Schor
>> >> suggestion.
>> >>
>> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
>> >> > I think there's another problem ... the characters we have tested
>> >> > with
>> >> are
>> >> > not in the Bamum unicode set.  The first 2 that Marshall listed in
>> >> > utf-8
>> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd
>> >> > (EF
>> >> > BF
>> >> > BD) is xFFFD.  This last one is the "replacement character" used
>> >> > when
>> >> > an
>> >> > illegal character is encountered.  According to Wikipedia the 88
>> >> > Bamum
>> >> > characters are in the range xA6A0 - xA6F7.
>> >> >
>> >> > In order to reproduce your problem we need to yse the same
>> >> > codepoints.
>> >> Can
>> >> > you tell us what the hex value of the failing characters are, in
>> >> > UTF-8
>> >> > or
>> >> > UTF-!6?
>> >> >
>> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
>> >> runAE,
>> >> > following the quick test described in the UIMA-AS README.
>> >> >
>> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor 
>> wrote:
>> >> >
>> >> >> Maybe we've been on the wrong line of thinking.
>> >> >>
>> >> >> Perhaps the translation between UTF-8 (during transportation) and
>> >> >> the
>> >> >> string
>> >> >> characters is fine, but the XML parsing is restricting the
>> >> >> character
>> >> >> set
>> >> >> it uses.
>> >> >>
>> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>> >> >>
>> >> >> where it says valid xml characters exclude the "surrogates", which
>> >> >> your
>> >> >> characters I think are.
>> >> >>
>> >> >> So, perhaps it's XML parsing which is complaining (and it appears
>> this
>> >> is
>> >> >> so,
>> >> >> from the stack trace).
>> >> >>
>> >> >> We should point out that UIMA's character offsets (like begin an
>> >> >> end)
>> >> >> were
>> >> >> designed with Java String character offsets, and will perhaps not
>> work
>> >> >> correctly
>> >> >> when surrogates are being used.
>> >> >>
>> >> >> A possible workaround for this particular issue may be to switch to
>> >> >> binary
>> >> >> serialization, instead of xmi serialization. This has a restriction
>> in
>> >> >> that the
>> >> >> type systems much be identical (between the client and server).
>> >> >>
>> >> >> We could possibly get more confirmation of this hypothesis 

Re: Proccesing Bamun characters

2016-12-18 Thread Burn Lewis
Since these characters are above the basic UTF-16 limit they are
represented as 2 UTF-16 characters with high & low surrogate prefixes.  So
55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
6980, and after adding 2*16 (since only characters above this need
surrogate pairs) we have the expected x16980.
So one mystery is: their appearance in the CAS with the &# notation.  When
I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character,
e.g. in hex  f096 a680 f096 a690.
What collection reader are you using?

On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera 
wrote:

> This is the cas serialize to xmi before send to uima-as service,
> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
> The representation of the characters In this serialization does not
> match with the representation of characters with problems. It's being
> serialized the code points escape sequences corresponding to the Bamum
> characters, two code point by each character.
> Why can this happen? Any suggestions
>
>  xmlns:cas="http:///uima/cas.ecore"; xmlns:xmi="http://www.omg.org/XMI";
> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore";
> xmlns:tcas="http:///uima/tcas.ecore";
> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore";
> xmi:version="2.0"> xmi:id="8" sofa="1" begin="0" end="12"
> language="x-unspecified"/> sofaID="_InitialView" mimeType="text" sofaString="��
> ��  �  �"/>
>
>
> 2016-12-16 14:06 GMT-05:00, Burn Lewis :
> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
> > x16990 are valid.  runRemoteAsyncAE uses the same
> > FileSystemCollectionReader as runAE does ... did you use a different
> > collection reader?  If a custom one perhaps you could serialize the cas
> to
> > a file as XMI and verify that the XMI is legal.
> >
> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera  >
> > wrote:
> >
> >> In Wikipedia the Bamum
> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
> >> valid range is U+16800–U+16A3F, any of theses characters generate the
> >> same log trace. I will continue to test the  Marshall Schor
> >> suggestion.
> >>
> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
> >> > I think there's another problem ... the characters we have tested with
> >> are
> >> > not in the Bamum unicode set.  The first 2 that Marshall listed in
> >> > utf-8
> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF
> >> > BF
> >> > BD) is xFFFD.  This last one is the "replacement character" used when
> >> > an
> >> > illegal character is encountered.  According to Wikipedia the 88 Bamum
> >> > characters are in the range xA6A0 - xA6F7.
> >> >
> >> > In order to reproduce your problem we need to yse the same codepoints.
> >> Can
> >> > you tell us what the hex value of the failing characters are, in UTF-8
> >> > or
> >> > UTF-!6?
> >> >
> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
> >> runAE,
> >> > following the quick test described in the UIMA-AS README.
> >> >
> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor 
> wrote:
> >> >
> >> >> Maybe we've been on the wrong line of thinking.
> >> >>
> >> >> Perhaps the translation between UTF-8 (during transportation) and the
> >> >> string
> >> >> characters is fine, but the XML parsing is restricting the character
> >> >> set
> >> >> it uses.
> >> >>
> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
> >> >>
> >> >> where it says valid xml characters exclude the "surrogates", which
> >> >> your
> >> >> characters I think are.
> >> >>
> >> >> So, perhaps it's XML parsing which is complaining (and it appears
> this
> >> is
> >> >> so,
> >> >> from the stack trace).
> >> >>
> >> >> We should point out that UIMA's character offsets (like begin an end)
> >> >> were
> >> >> designed with Java String character offsets, and will perhaps not
> work
> >> >> correctly
> >> >> when surrogates are being used.
> >> >>
> >> >> A possible workaround for this particular issue may be to switch to
> >> >> binary
> >> >> serialization, instead of xmi serialization. This has a restriction
> in
> >> >> that the
> >> >> type systems much be identical (between the client and server).
> >> >>
> >> >> We could possibly get more confirmation of this hypothesis if you
> >> >> could
> >> >> say what
> >> >> the stack trace was, beyond the first bit which you stated in your
> >> >> original
> >> >> note.  There should be more stack trace information, further down,
> >> >> starting with
> >> >> "caused by ..." which may provide more helpful information.
> >> >>
> >> >> -Marshall
> >> >>
> >> >>
> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
> >> >> > We also did that test with uima framework and RunAE tool and
> >> >> > thecharacters in a file as you, and effectively not exist problem.
> >> >> > The
> >> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> >> 

Re: Proccesing Bamun characters

2016-12-16 Thread nelson rivera
This is the cas serialize to xmi before send to uima-as service,
serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
The representation of the characters In this serialization does not
match with the representation of characters with problems. It's being
serialized the code points escape sequences corresponding to the Bamum
characters, two code point by each character.
Why can this happen? Any suggestions

http:///uima/cas.ecore"; xmlns:xmi="http://www.omg.org/XMI";
xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore";
xmlns:tcas="http:///uima/tcas.ecore";
xmlns:api="http:///cu/datys/xinetica/uima/api.ecore";
xmi:version="2.0">


2016-12-16 14:06 GMT-05:00, Burn Lewis :
> Sorry, I missed the supplement set.  So the tests I did with x16980 &
> x16990 are valid.  runRemoteAsyncAE uses the same
> FileSystemCollectionReader as runAE does ... did you use a different
> collection reader?  If a custom one perhaps you could serialize the cas to
> a file as XMI and verify that the XMI is legal.
>
> On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera 
> wrote:
>
>> In Wikipedia the Bamum
>> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>> valid range is U+16800–U+16A3F, any of theses characters generate the
>> same log trace. I will continue to test the  Marshall Schor
>> suggestion.
>>
>> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
>> > I think there's another problem ... the characters we have tested with
>> are
>> > not in the Bamum unicode set.  The first 2 that Marshall listed in
>> > utf-8
>> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF
>> > BF
>> > BD) is xFFFD.  This last one is the "replacement character" used when
>> > an
>> > illegal character is encountered.  According to Wikipedia the 88 Bamum
>> > characters are in the range xA6A0 - xA6F7.
>> >
>> > In order to reproduce your problem we need to yse the same codepoints.
>> Can
>> > you tell us what the hex value of the failing characters are, in UTF-8
>> > or
>> > UTF-!6?
>> >
>> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
>> runAE,
>> > following the quick test described in the UIMA-AS README.
>> >
>> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
>> >
>> >> Maybe we've been on the wrong line of thinking.
>> >>
>> >> Perhaps the translation between UTF-8 (during transportation) and the
>> >> string
>> >> characters is fine, but the XML parsing is restricting the character
>> >> set
>> >> it uses.
>> >>
>> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>> >>
>> >> where it says valid xml characters exclude the "surrogates", which
>> >> your
>> >> characters I think are.
>> >>
>> >> So, perhaps it's XML parsing which is complaining (and it appears this
>> is
>> >> so,
>> >> from the stack trace).
>> >>
>> >> We should point out that UIMA's character offsets (like begin an end)
>> >> were
>> >> designed with Java String character offsets, and will perhaps not work
>> >> correctly
>> >> when surrogates are being used.
>> >>
>> >> A possible workaround for this particular issue may be to switch to
>> >> binary
>> >> serialization, instead of xmi serialization. This has a restriction in
>> >> that the
>> >> type systems much be identical (between the client and server).
>> >>
>> >> We could possibly get more confirmation of this hypothesis if you
>> >> could
>> >> say what
>> >> the stack trace was, beyond the first bit which you stated in your
>> >> original
>> >> note.  There should be more stack trace information, further down,
>> >> starting with
>> >> "caused by ..." which may provide more helpful information.
>> >>
>> >> -Marshall
>> >>
>> >>
>> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> >> > We also did that test with uima framework and RunAE tool and
>> >> > thecharacters in a file as you, and effectively not exist problem.
>> >> > The
>> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
>> >> > when trying to deserialize the cas deserializeCasFromXmi() in remote
>> >> > uima-as service, that  i get the mentioned exception
>> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> >> > Character reference "&#"
>> >> >
>> >> > In my case i don't read any file, not use
>> >> > FileSystemCollectionReader.
>> >> > The user introduces the text, the text is stored in string java
>> >> > (utf-16) and it set to the cas that will be processing, using
>> >> > setDocumentLanguage, then i send the cas.
>> >> >
>> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis :
>> >> >> I put these 3 characters as UTF-8 in a file in examples/data and
>> >> >> ran
>> >> >> the
>> >> >> MeetingDetector annotator as described in section 3.4 of the
>> >> >> README,
>> >> adding
>> >> >> the option "-o out".  In that folder I found the returned results
>> >> >> in
>> >> >> xmi
>> >> >> format with the characters in the sofaString element.  The relevant
>> >> part of
>> >> >> this file in hex is:
>> >> >>
>> >> >> 02e0: 7472 

Re: Proccesing Bamun characters

2016-12-16 Thread Burn Lewis
Sorry, I missed the supplement set.  So the tests I did with x16980 &
x16990 are valid.  runRemoteAsyncAE uses the same
FileSystemCollectionReader as runAE does ... did you use a different
collection reader?  If a custom one perhaps you could serialize the cas to
a file as XMI and verify that the XMI is legal.

On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera 
wrote:

> In Wikipedia the Bamum
> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
> valid range is U+16800–U+16A3F, any of theses characters generate the
> same log trace. I will continue to test the  Marshall Schor
> suggestion.
>
> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
> > I think there's another problem ... the characters we have tested with
> are
> > not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> > BD) is xFFFD.  This last one is the "replacement character" used when an
> > illegal character is encountered.  According to Wikipedia the 88 Bamum
> > characters are in the range xA6A0 - xA6F7.
> >
> > In order to reproduce your problem we need to yse the same codepoints.
> Can
> > you tell us what the hex value of the failing characters are, in UTF-8 or
> > UTF-!6?
> >
> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
> runAE,
> > following the quick test described in the UIMA-AS README.
> >
> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
> >
> >> Maybe we've been on the wrong line of thinking.
> >>
> >> Perhaps the translation between UTF-8 (during transportation) and the
> >> string
> >> characters is fine, but the XML parsing is restricting the character set
> >> it uses.
> >>
> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
> >>
> >> where it says valid xml characters exclude the "surrogates", which your
> >> characters I think are.
> >>
> >> So, perhaps it's XML parsing which is complaining (and it appears this
> is
> >> so,
> >> from the stack trace).
> >>
> >> We should point out that UIMA's character offsets (like begin an end)
> >> were
> >> designed with Java String character offsets, and will perhaps not work
> >> correctly
> >> when surrogates are being used.
> >>
> >> A possible workaround for this particular issue may be to switch to
> >> binary
> >> serialization, instead of xmi serialization. This has a restriction in
> >> that the
> >> type systems much be identical (between the client and server).
> >>
> >> We could possibly get more confirmation of this hypothesis if you could
> >> say what
> >> the stack trace was, beyond the first bit which you stated in your
> >> original
> >> note.  There should be more stack trace information, further down,
> >> starting with
> >> "caused by ..." which may provide more helpful information.
> >>
> >> -Marshall
> >>
> >>
> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
> >> > We also did that test with uima framework and RunAE tool and
> >> > thecharacters in a file as you, and effectively not exist problem. The
> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> >> > when trying to deserialize the cas deserializeCasFromXmi() in remote
> >> > uima-as service, that  i get the mentioned exception
> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >> > Character reference "&#"
> >> >
> >> > In my case i don't read any file, not use FileSystemCollectionReader.
> >> > The user introduces the text, the text is stored in string java
> >> > (utf-16) and it set to the cas that will be processing, using
> >> > setDocumentLanguage, then i send the cas.
> >> >
> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis :
> >> >> I put these 3 characters as UTF-8 in a file in examples/data and ran
> >> >> the
> >> >> MeetingDetector annotator as described in section 3.4 of the README,
> >> adding
> >> >> the option "-o out".  In that folder I found the returned results in
> >> >> xmi
> >> >> format with the characters in the sofaString element.  The relevant
> >> part of
> >> >> this file in hex is:
> >> >>
> >> >> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
> >> >> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/> >> >>
> >> >> Note that the FileSystemCollectionReader by default uses the system
> >> >> encoding but you could add a ConfigurationParameterSetting of UTF-8
> >> >> for
> >> the
> >> >> Encoding parameter in its descriptor.
> >> >>
> >> >> With the client & server on different (Linux) machines I see no
> >> >> problem
> >> >> with sending UTF-8 characters.
> >> >>
> >> >>
> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor 
> wrote:
> >> >>
> >> >>> another question:  I assume there are perhaps 2 machines involved,
> >> >>> here
> >> >>> (it's a
> >> >>> UIMA-AS setup).
> >> >>>
> >> >>> From the exception, it appears that the error happen when the client
> >> >>> sends
> >> >>> the
> >> >>> CAS to the remote.
> >> >>>
> >> >>> Can you print out the L

Re: Proccesing Bamun characters

2016-12-16 Thread nelson rivera
In Wikipedia the Bamum
Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
valid range is U+16800–U+16A3F, any of theses characters generate the
same log trace. I will continue to test the  Marshall Schor
suggestion.

2016-12-14 18:07 GMT-05:00, Burn Lewis :
> I think there's another problem ... the characters we have tested with are
> not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> BD) is xFFFD.  This last one is the "replacement character" used when an
> illegal character is encountered.  According to Wikipedia the 88 Bamum
> characters are in the range xA6A0 - xA6F7.
>
> In order to reproduce your problem we need to yse the same codepoints.  Can
> you tell us what the hex value of the failing characters are, in UTF-8 or
> UTF-!6?
>
> By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE,
> following the quick test described in the UIMA-AS README.
>
> On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
>
>> Maybe we've been on the wrong line of thinking.
>>
>> Perhaps the translation between UTF-8 (during transportation) and the
>> string
>> characters is fine, but the XML parsing is restricting the character set
>> it uses.
>>
>> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>>
>> where it says valid xml characters exclude the "surrogates", which your
>> characters I think are.
>>
>> So, perhaps it's XML parsing which is complaining (and it appears this is
>> so,
>> from the stack trace).
>>
>> We should point out that UIMA's character offsets (like begin an end)
>> were
>> designed with Java String character offsets, and will perhaps not work
>> correctly
>> when surrogates are being used.
>>
>> A possible workaround for this particular issue may be to switch to
>> binary
>> serialization, instead of xmi serialization. This has a restriction in
>> that the
>> type systems much be identical (between the client and server).
>>
>> We could possibly get more confirmation of this hypothesis if you could
>> say what
>> the stack trace was, beyond the first bit which you stated in your
>> original
>> note.  There should be more stack trace information, further down,
>> starting with
>> "caused by ..." which may provide more helpful information.
>>
>> -Marshall
>>
>>
>> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> > We also did that test with uima framework and RunAE tool and
>> > thecharacters in a file as you, and effectively not exist problem. The
>> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
>> > when trying to deserialize the cas deserializeCasFromXmi() in remote
>> > uima-as service, that  i get the mentioned exception
>> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> > Character reference "&#"
>> >
>> > In my case i don't read any file, not use FileSystemCollectionReader.
>> > The user introduces the text, the text is stored in string java
>> > (utf-16) and it set to the cas that will be processing, using
>> > setDocumentLanguage, then i send the cas.
>> >
>> > 2016-12-13 15:10 GMT-05:00, Burn Lewis :
>> >> I put these 3 characters as UTF-8 in a file in examples/data and ran
>> >> the
>> >> MeetingDetector annotator as described in section 3.4 of the README,
>> adding
>> >> the option "-o out".  In that folder I found the returned results in
>> >> xmi
>> >> format with the characters in the sofaString element.  The relevant
>> part of
>> >> this file in hex is:
>> >>
>> >> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
>> >> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/>> >>
>> >> Note that the FileSystemCollectionReader by default uses the system
>> >> encoding but you could add a ConfigurationParameterSetting of UTF-8
>> >> for
>> the
>> >> Encoding parameter in its descriptor.
>> >>
>> >> With the client & server on different (Linux) machines I see no
>> >> problem
>> >> with sending UTF-8 characters.
>> >>
>> >>
>> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor  wrote:
>> >>
>> >>> another question:  I assume there are perhaps 2 machines involved,
>> >>> here
>> >>> (it's a
>> >>> UIMA-AS setup).
>> >>>
>> >>> From the exception, it appears that the error happen when the client
>> >>> sends
>> >>> the
>> >>> CAS to the remote.
>> >>>
>> >>> Can you print out the Linux (assuming that's the OS) default locale
>> >>> for
>> >>> both
>> >>> machines?  (e.g. type into a command line "locale" and see what each
>> >>> machines
>> >>> has as its default character encoding).
>> >>>
>> >>> Please let us know what these are.
>> >>>
>> >>> Thanks. -Marshall
>> >>>
>> >>>
>> >>>
>> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>>  Yes these are the values of the troublesome characters, using
>>  Integer.toHexString() to print out each byte, shows
>> 
>>  fff0 ff96 ffa6 ff80
>> 
>>  fff0 ff96 ffa6 ff90
>> 
>> 

Re: Proccesing Bamun characters

2016-12-14 Thread Burn Lewis
I think there's another problem ... the characters we have tested with are
not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
(F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
BD) is xFFFD.  This last one is the "replacement character" used when an
illegal character is encountered.  According to Wikipedia the 88 Bamum
characters are in the range xA6A0 - xA6F7.

In order to reproduce your problem we need to yse the same codepoints.  Can
you tell us what the hex value of the failing characters are, in UTF-8 or
UTF-!6?

By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE,
following the quick test described in the UIMA-AS README.

On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:

> Maybe we've been on the wrong line of thinking.
>
> Perhaps the translation between UTF-8 (during transportation) and the
> string
> characters is fine, but the XML parsing is restricting the character set
> it uses.
>
> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>
> where it says valid xml characters exclude the "surrogates", which your
> characters I think are.
>
> So, perhaps it's XML parsing which is complaining (and it appears this is
> so,
> from the stack trace).
>
> We should point out that UIMA's character offsets (like begin an end) were
> designed with Java String character offsets, and will perhaps not work
> correctly
> when surrogates are being used.
>
> A possible workaround for this particular issue may be to switch to binary
> serialization, instead of xmi serialization. This has a restriction in
> that the
> type systems much be identical (between the client and server).
>
> We could possibly get more confirmation of this hypothesis if you could
> say what
> the stack trace was, beyond the first bit which you stated in your original
> note.  There should be more stack trace information, further down,
> starting with
> "caused by ..." which may provide more helpful information.
>
> -Marshall
>
>
> On 12/14/2016 9:38 AM, nelson rivera wrote:
> > We also did that test with uima framework and RunAE tool and
> > thecharacters in a file as you, and effectively not exist problem. The
> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> > when trying to deserialize the cas deserializeCasFromXmi() in remote
> > uima-as service, that  i get the mentioned exception
> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> > Character reference "&#"
> >
> > In my case i don't read any file, not use FileSystemCollectionReader.
> > The user introduces the text, the text is stored in string java
> > (utf-16) and it set to the cas that will be processing, using
> > setDocumentLanguage, then i send the cas.
> >
> > 2016-12-13 15:10 GMT-05:00, Burn Lewis :
> >> I put these 3 characters as UTF-8 in a file in examples/data and ran the
> >> MeetingDetector annotator as described in section 3.4 of the README,
> adding
> >> the option "-o out".  In that folder I found the returned results in xmi
> >> format with the characters in the sofaString element.  The relevant
> part of
> >> this file in hex is:
> >>
> >> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
> >> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/> >>
> >> Note that the FileSystemCollectionReader by default uses the system
> >> encoding but you could add a ConfigurationParameterSetting of UTF-8 for
> the
> >> Encoding parameter in its descriptor.
> >>
> >> With the client & server on different (Linux) machines I see no problem
> >> with sending UTF-8 characters.
> >>
> >>
> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor  wrote:
> >>
> >>> another question:  I assume there are perhaps 2 machines involved, here
> >>> (it's a
> >>> UIMA-AS setup).
> >>>
> >>> From the exception, it appears that the error happen when the client
> >>> sends
> >>> the
> >>> CAS to the remote.
> >>>
> >>> Can you print out the Linux (assuming that's the OS) default locale for
> >>> both
> >>> machines?  (e.g. type into a command line "locale" and see what each
> >>> machines
> >>> has as its default character encoding).
> >>>
> >>> Please let us know what these are.
> >>>
> >>> Thanks. -Marshall
> >>>
> >>>
> >>>
> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>  Yes these are the values of the troublesome characters, using
>  Integer.toHexString() to print out each byte, shows
> 
>  fff0 ff96 ffa6 ff80
> 
>  fff0 ff96 ffa6 ff90
> 
>  ffef ffbf ffbd
> 
>  ffef ffbf ffbd
> 
>  2016-12-12 11:35 GMT-05:00, Marshall Schor :
> > Hi Nelson,
> >
> > Looking into this... Can you please confirm that the UTF-8 coding of
> > the
> > troublesome characters, in hexadecimal, is:
> >
> > F0 96 A6 80
> >
> > F0 96 A6 90
> >
> > EF BF BD
> >
> > EF BF BD
> >
> > If you have the string in Java

Re: Proccesing Bamun characters

2016-12-14 Thread Marshall Schor
Maybe we've been on the wrong line of thinking. 

Perhaps the translation between UTF-8 (during transportation) and the string
characters is fine, but the XML parsing is restricting the character set it 
uses.

See https://en.wikipedia.org/wiki/Valid_characters_in_XML

where it says valid xml characters exclude the "surrogates", which your
characters I think are.

So, perhaps it's XML parsing which is complaining (and it appears this is so,
from the stack trace).

We should point out that UIMA's character offsets (like begin an end) were
designed with Java String character offsets, and will perhaps not work correctly
when surrogates are being used.

A possible workaround for this particular issue may be to switch to binary
serialization, instead of xmi serialization. This has a restriction in that the
type systems much be identical (between the client and server).

We could possibly get more confirmation of this hypothesis if you could say what
the stack trace was, beyond the first bit which you stated in your original
note.  There should be more stack trace information, further down, starting with
"caused by ..." which may provide more helpful information.

-Marshall


On 12/14/2016 9:38 AM, nelson rivera wrote:
> We also did that test with uima framework and RunAE tool and
> thecharacters in a file as you, and effectively not exist problem. The
> problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> when trying to deserialize the cas deserializeCasFromXmi() in remote
> uima-as service, that  i get the mentioned exception
> "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> Character reference "&#"
>
> In my case i don't read any file, not use FileSystemCollectionReader.
> The user introduces the text, the text is stored in string java
> (utf-16) and it set to the cas that will be processing, using
> setDocumentLanguage, then i send the cas.
>
> 2016-12-13 15:10 GMT-05:00, Burn Lewis :
>> I put these 3 characters as UTF-8 in a file in examples/data and ran the
>> MeetingDetector annotator as described in section 3.4 of the README, adding
>> the option "-o out".  In that folder I found the returned results in xmi
>> format with the characters in the sofaString element.  The relevant part of
>> this file in hex is:
>>
>> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
>> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/>>
>> Note that the FileSystemCollectionReader by default uses the system
>> encoding but you could add a ConfigurationParameterSetting of UTF-8 for the
>> Encoding parameter in its descriptor.
>>
>> With the client & server on different (Linux) machines I see no problem
>> with sending UTF-8 characters.
>>
>>
>> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor  wrote:
>>
>>> another question:  I assume there are perhaps 2 machines involved, here
>>> (it's a
>>> UIMA-AS setup).
>>>
>>> From the exception, it appears that the error happen when the client
>>> sends
>>> the
>>> CAS to the remote.
>>>
>>> Can you print out the Linux (assuming that's the OS) default locale for
>>> both
>>> machines?  (e.g. type into a command line "locale" and see what each
>>> machines
>>> has as its default character encoding).
>>>
>>> Please let us know what these are.
>>>
>>> Thanks. -Marshall
>>>
>>>
>>>
>>> On 12/12/2016 1:58 PM, nelson rivera wrote:
 Yes these are the values of the troublesome characters, using
 Integer.toHexString() to print out each byte, shows

 fff0 ff96 ffa6 ff80

 fff0 ff96 ffa6 ff90

 ffef ffbf ffbd

 ffef ffbf ffbd

 2016-12-12 11:35 GMT-05:00, Marshall Schor :
> Hi Nelson,
>
> Looking into this... Can you please confirm that the UTF-8 coding of
> the
> troublesome characters, in hexadecimal, is:
>
> F0 96 A6 80
>
> F0 96 A6 90
>
> EF BF BD
>
> EF BF BD
>
> If you have the string in Java, please try converting it to a UTF-8
>>> string
> using
> something like:
>   byte[] theBytes = myTestString.getBytes("UTF-8");
>
>   and then print out theBytes in hex; they should look like the above.
>>> If
> not,
> please let us know what the values is instead.
>
>
> Thanks. -Marshall
>
>
> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> Hi i was read your explication and saw the link, but in my case, i
>> don't read any xml file. Just i copy the text, get a new input cas
>> from UimaAsynchronousEngine with getCAS(), set the text in the cas
>> and
>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the
>> client
>> side. Apparently the characters are changed for its entities
>> corresponding when serialize the cas to send it, but i get the
>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> columnNumber: 571; Character reference "&#"
>> in uima-as fr

Re: Proccesing Bamun characters

2016-12-14 Thread Marshall Schor
Hi Nelson, thanks for clarifying.

Can you say what the default locale's were one the two machines?

Are they both UTF-8?  If not, could you try setting both to that?

Also, can you capture the first part of the XMI cas being serialized to the
remote, in byte format, and confirm it is encoded (I think as UTF-8, not sure),
and also see if the xml header for the xmi serialized cas starts with a string:



Thanks -Marshall

On 12/14/2016 9:38 AM, nelson rivera wrote:
> We also did that test with uima framework and RunAE tool and
> thecharacters in a file as you, and effectively not exist problem. The
> problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> when trying to deserialize the cas deserializeCasFromXmi() in remote
> uima-as service, that  i get the mentioned exception
> "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> Character reference "&#"
>
> In my case i don't read any file, not use FileSystemCollectionReader.
> The user introduces the text, the text is stored in string java
> (utf-16) and it set to the cas that will be processing, using
> setDocumentLanguage, then i send the cas.
>
> 2016-12-13 15:10 GMT-05:00, Burn Lewis :
>> I put these 3 characters as UTF-8 in a file in examples/data and ran the
>> MeetingDetector annotator as described in section 3.4 of the README, adding
>> the option "-o out".  In that folder I found the returned results in xmi
>> format with the characters in the sofaString element.  The relevant part of
>> this file in hex is:
>>
>> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
>> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/>>
>> Note that the FileSystemCollectionReader by default uses the system
>> encoding but you could add a ConfigurationParameterSetting of UTF-8 for the
>> Encoding parameter in its descriptor.
>>
>> With the client & server on different (Linux) machines I see no problem
>> with sending UTF-8 characters.
>>
>>
>> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor  wrote:
>>
>>> another question:  I assume there are perhaps 2 machines involved, here
>>> (it's a
>>> UIMA-AS setup).
>>>
>>> From the exception, it appears that the error happen when the client
>>> sends
>>> the
>>> CAS to the remote.
>>>
>>> Can you print out the Linux (assuming that's the OS) default locale for
>>> both
>>> machines?  (e.g. type into a command line "locale" and see what each
>>> machines
>>> has as its default character encoding).
>>>
>>> Please let us know what these are.
>>>
>>> Thanks. -Marshall
>>>
>>>
>>>
>>> On 12/12/2016 1:58 PM, nelson rivera wrote:
 Yes these are the values of the troublesome characters, using
 Integer.toHexString() to print out each byte, shows

 fff0 ff96 ffa6 ff80

 fff0 ff96 ffa6 ff90

 ffef ffbf ffbd

 ffef ffbf ffbd

 2016-12-12 11:35 GMT-05:00, Marshall Schor :
> Hi Nelson,
>
> Looking into this... Can you please confirm that the UTF-8 coding of
> the
> troublesome characters, in hexadecimal, is:
>
> F0 96 A6 80
>
> F0 96 A6 90
>
> EF BF BD
>
> EF BF BD
>
> If you have the string in Java, please try converting it to a UTF-8
>>> string
> using
> something like:
>   byte[] theBytes = myTestString.getBytes("UTF-8");
>
>   and then print out theBytes in hex; they should look like the above.
>>> If
> not,
> please let us know what the values is instead.
>
>
> Thanks. -Marshall
>
>
> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> Hi i was read your explication and saw the link, but in my case, i
>> don't read any xml file. Just i copy the text, get a new input cas
>> from UimaAsynchronousEngine with getCAS(), set the text in the cas
>> and
>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the
>> client
>> side. Apparently the characters are changed for its entities
>> corresponding when serialize the cas to send it, but i get the
>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> columnNumber: 571; Character reference "&#"
>> in uima-as framework installed when trying to deserialize the cas
>> deserializeCasFromXmi(),to be processed for the service.
>>
>> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
>>> Hi Nelson,
>>>
>>> I can't see the characters (sorry).
>>>
>>> This might be an issue caused by a discrepancy between the coding of
>>> the
>>> file
>>> being read, and the coding indicated on the xml header.  Can you
>>> check
>>> that
>>> those two things are the same?
>>>
>>> See
>>> http://stackoverflow.com/questions/5165347/what-use-is-
>>> the-encoding-in-the-xml-header
>>> for example.
>>>
>>> -Marshall
>>>
>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
 i tried to proccess the following text

Re: Proccesing Bamun characters

2016-12-14 Thread nelson rivera
We also did that test with uima framework and RunAE tool and
thecharacters in a file as you, and effectively not exist problem. The
problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
when trying to deserialize the cas deserializeCasFromXmi() in remote
uima-as service, that  i get the mentioned exception
"org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
Character reference "&#"

In my case i don't read any file, not use FileSystemCollectionReader.
The user introduces the text, the text is stored in string java
(utf-16) and it set to the cas that will be processing, using
setDocumentLanguage, then i send the cas.

2016-12-13 15:10 GMT-05:00, Burn Lewis :
> I put these 3 characters as UTF-8 in a file in examples/data and ran the
> MeetingDetector annotator as described in section 3.4 of the README, adding
> the option "-o out".  In that folder I found the returned results in xmi
> format with the characters in the sofaString element.  The relevant part of
> this file in hex is:
>
> 02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
> 02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/>
> Note that the FileSystemCollectionReader by default uses the system
> encoding but you could add a ConfigurationParameterSetting of UTF-8 for the
> Encoding parameter in its descriptor.
>
> With the client & server on different (Linux) machines I see no problem
> with sending UTF-8 characters.
>
>
> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor  wrote:
>
>> another question:  I assume there are perhaps 2 machines involved, here
>> (it's a
>> UIMA-AS setup).
>>
>> From the exception, it appears that the error happen when the client
>> sends
>> the
>> CAS to the remote.
>>
>> Can you print out the Linux (assuming that's the OS) default locale for
>> both
>> machines?  (e.g. type into a command line "locale" and see what each
>> machines
>> has as its default character encoding).
>>
>> Please let us know what these are.
>>
>> Thanks. -Marshall
>>
>>
>>
>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>> > Yes these are the values of the troublesome characters, using
>> > Integer.toHexString() to print out each byte, shows
>> >
>> > fff0 ff96 ffa6 ff80
>> >
>> > fff0 ff96 ffa6 ff90
>> >
>> > ffef ffbf ffbd
>> >
>> > ffef ffbf ffbd
>> >
>> > 2016-12-12 11:35 GMT-05:00, Marshall Schor :
>> >> Hi Nelson,
>> >>
>> >> Looking into this... Can you please confirm that the UTF-8 coding of
>> >> the
>> >> troublesome characters, in hexadecimal, is:
>> >>
>> >> F0 96 A6 80
>> >>
>> >> F0 96 A6 90
>> >>
>> >> EF BF BD
>> >>
>> >> EF BF BD
>> >>
>> >> If you have the string in Java, please try converting it to a UTF-8
>> string
>> >> using
>> >> something like:
>> >>   byte[] theBytes = myTestString.getBytes("UTF-8");
>> >>
>> >>   and then print out theBytes in hex; they should look like the above.
>> If
>> >> not,
>> >> please let us know what the values is instead.
>> >>
>> >>
>> >> Thanks. -Marshall
>> >>
>> >>
>> >> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> >>> Hi i was read your explication and saw the link, but in my case, i
>> >>> don't read any xml file. Just i copy the text, get a new input cas
>> >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas
>> >>> and
>> >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the
>> >>> client
>> >>> side. Apparently the characters are changed for its entities
>> >>> corresponding when serialize the cas to send it, but i get the
>> >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> >>> columnNumber: 571; Character reference "&#"
>> >>> in uima-as framework installed when trying to deserialize the cas
>> >>> deserializeCasFromXmi(),to be processed for the service.
>> >>>
>> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
>>  Hi Nelson,
>> 
>>  I can't see the characters (sorry).
>> 
>>  This might be an issue caused by a discrepancy between the coding of
>> the
>>  file
>>  being read, and the coding indicated on the xml header.  Can you
>>  check
>>  that
>>  those two things are the same?
>> 
>>  See
>>  http://stackoverflow.com/questions/5165347/what-use-is-
>> the-encoding-in-the-xml-header
>>  for example.
>> 
>>  -Marshall
>> 
>>  On 12/8/2016 4:20 PM, nelson rivera wrote:
>> > i tried to proccess the following text in a service deploy in
>> uima-as,
>> > because is input of my application. This is the text : 𖦀  𖦐  �
>> > �.
>> > These characters correspond to the bamun language, and apparently
>> > are
>> > not  invalid xml characters because tools such as browsers
>> > interpret
>> > it and show it. After get a new input cas to proccesing, set the
>> > text
>> > and send the request, i get  the exception that i show below in
>> > uima-as, the framework uima-as work and recovers correctly, just
>> > n

Re: Proccesing Bamun characters

2016-12-13 Thread Burn Lewis
I put these 3 characters as UTF-8 in a file in examples/data and ran the
MeetingDetector annotator as described in section 3.4 of the README, adding
the option "-o out".  In that folder I found the returned results in xmi
format with the characters in the sofaString element.  The relevant part of
this file in hex is:

02e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".
02f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..
"/> wrote:

> another question:  I assume there are perhaps 2 machines involved, here
> (it's a
> UIMA-AS setup).
>
> From the exception, it appears that the error happen when the client sends
> the
> CAS to the remote.
>
> Can you print out the Linux (assuming that's the OS) default locale for
> both
> machines?  (e.g. type into a command line "locale" and see what each
> machines
> has as its default character encoding).
>
> Please let us know what these are.
>
> Thanks. -Marshall
>
>
>
> On 12/12/2016 1:58 PM, nelson rivera wrote:
> > Yes these are the values of the troublesome characters, using
> > Integer.toHexString() to print out each byte, shows
> >
> > fff0 ff96 ffa6 ff80
> >
> > fff0 ff96 ffa6 ff90
> >
> > ffef ffbf ffbd
> >
> > ffef ffbf ffbd
> >
> > 2016-12-12 11:35 GMT-05:00, Marshall Schor :
> >> Hi Nelson,
> >>
> >> Looking into this... Can you please confirm that the UTF-8 coding of the
> >> troublesome characters, in hexadecimal, is:
> >>
> >> F0 96 A6 80
> >>
> >> F0 96 A6 90
> >>
> >> EF BF BD
> >>
> >> EF BF BD
> >>
> >> If you have the string in Java, please try converting it to a UTF-8
> string
> >> using
> >> something like:
> >>   byte[] theBytes = myTestString.getBytes("UTF-8");
> >>
> >>   and then print out theBytes in hex; they should look like the above.
> If
> >> not,
> >> please let us know what the values is instead.
> >>
> >>
> >> Thanks. -Marshall
> >>
> >>
> >> On 12/9/2016 9:02 AM, nelson rivera wrote:
> >>> Hi i was read your explication and saw the link, but in my case, i
> >>> don't read any xml file. Just i copy the text, get a new input cas
> >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
> >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
> >>> side. Apparently the characters are changed for its entities
> >>> corresponding when serialize the cas to send it, but i get the
> >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
> >>> columnNumber: 571; Character reference "&#"
> >>> in uima-as framework installed when trying to deserialize the cas
> >>> deserializeCasFromXmi(),to be processed for the service.
> >>>
> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
>  Hi Nelson,
> 
>  I can't see the characters (sorry).
> 
>  This might be an issue caused by a discrepancy between the coding of
> the
>  file
>  being read, and the coding indicated on the xml header.  Can you check
>  that
>  those two things are the same?
> 
>  See
>  http://stackoverflow.com/questions/5165347/what-use-is-
> the-encoding-in-the-xml-header
>  for example.
> 
>  -Marshall
> 
>  On 12/8/2016 4:20 PM, nelson rivera wrote:
> > i tried to proccess the following text in a service deploy in
> uima-as,
> > because is input of my application. This is the text : 𖦀  𖦐  �  �.
> > These characters correspond to the bamun language, and apparently are
> > not  invalid xml characters because tools such as browsers interpret
> > it and show it. After get a new input cas to proccesing, set the text
> > and send the request, i get  the exception that i show below in
> > uima-as, the framework uima-as work and recovers correctly, just not
> > process this characters.
> > Could you tell me what happens with these characters, one of these is
> > invalid characters for framework uima-as?
> >
> >
> >
> > 04:00:31.606 - 14:
> > org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient:
> > WARNING:
> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> > Character reference "&#
> > at
> > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(
> AbstractSAXParser.java:1239)
> > at
> > org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
> UimaSerializer.java:187)
> > at
> > org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
> > at
> > org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
> > at
> > org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(
> ProcessRequestHandler_impl.java:1090)
> > at
> > org.apache.uima.aae.handler.input.MetadataRequestHandler_
> i

Re: Proccesing Bamun characters

2016-12-12 Thread Marshall Schor
another question:  I assume there are perhaps 2 machines involved, here (it's a
UIMA-AS setup). 

>From the exception, it appears that the error happen when the client sends the
CAS to the remote.

Can you print out the Linux (assuming that's the OS) default locale for both
machines?  (e.g. type into a command line "locale" and see what each machines
has as its default character encoding).

Please let us know what these are.

Thanks. -Marshall



On 12/12/2016 1:58 PM, nelson rivera wrote:
> Yes these are the values of the troublesome characters, using
> Integer.toHexString() to print out each byte, shows
>
> fff0 ff96 ffa6 ff80
>
> fff0 ff96 ffa6 ff90
>
> ffef ffbf ffbd
>
> ffef ffbf ffbd
>
> 2016-12-12 11:35 GMT-05:00, Marshall Schor :
>> Hi Nelson,
>>
>> Looking into this... Can you please confirm that the UTF-8 coding of the
>> troublesome characters, in hexadecimal, is:
>>
>> F0 96 A6 80
>>
>> F0 96 A6 90
>>
>> EF BF BD
>>
>> EF BF BD
>>
>> If you have the string in Java, please try converting it to a UTF-8 string
>> using
>> something like:
>>   byte[] theBytes = myTestString.getBytes("UTF-8");
>>
>>   and then print out theBytes in hex; they should look like the above.  If
>> not,
>> please let us know what the values is instead.
>>
>>
>> Thanks. -Marshall
>>
>>
>> On 12/9/2016 9:02 AM, nelson rivera wrote:
>>> Hi i was read your explication and saw the link, but in my case, i
>>> don't read any xml file. Just i copy the text, get a new input cas
>>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
>>> side. Apparently the characters are changed for its entities
>>> corresponding when serialize the cas to send it, but i get the
>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>>> columnNumber: 571; Character reference "&#"
>>> in uima-as framework installed when trying to deserialize the cas
>>> deserializeCasFromXmi(),to be processed for the service.
>>>
>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
 Hi Nelson,

 I can't see the characters (sorry).

 This might be an issue caused by a discrepancy between the coding of the
 file
 being read, and the coding indicated on the xml header.  Can you check
 that
 those two things are the same?

 See
 http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
 for example.

 -Marshall

 On 12/8/2016 4:20 PM, nelson rivera wrote:
> i tried to proccess the following text in a service deploy in uima-as,
> because is input of my application. This is the text : 𖦀  𖦐  �  �.
> These characters correspond to the bamun language, and apparently are
> not  invalid xml characters because tools such as browsers interpret
> it and show it. After get a new input cas to proccesing, set the text
> and send the request, i get  the exception that i show below in
> uima-as, the framework uima-as work and recovers correctly, just not
> process this characters.
> Could you tell me what happens with these characters, one of these is
> invalid characters for framework uima-as?
>
>
>
> 04:00:31.606 - 14:
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
> WARNING:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> Character reference "&#
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
> at
> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
> at
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
> at
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
> at
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
> at
> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
> at
> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>
>>



Re: Proccesing Bamun characters

2016-12-12 Thread nelson rivera
Yes these are the values of the troublesome characters, using
Integer.toHexString() to print out each byte, shows

fff0 ff96 ffa6 ff80

fff0 ff96 ffa6 ff90

ffef ffbf ffbd

ffef ffbf ffbd

2016-12-12 11:35 GMT-05:00, Marshall Schor :
> Hi Nelson,
>
> Looking into this... Can you please confirm that the UTF-8 coding of the
> troublesome characters, in hexadecimal, is:
>
> F0 96 A6 80
>
> F0 96 A6 90
>
> EF BF BD
>
> EF BF BD
>
> If you have the string in Java, please try converting it to a UTF-8 string
> using
> something like:
>   byte[] theBytes = myTestString.getBytes("UTF-8");
>
>   and then print out theBytes in hex; they should look like the above.  If
> not,
> please let us know what the values is instead.
>
>
> Thanks. -Marshall
>
>
> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> Hi i was read your explication and saw the link, but in my case, i
>> don't read any xml file. Just i copy the text, get a new input cas
>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
>> side. Apparently the characters are changed for its entities
>> corresponding when serialize the cas to send it, but i get the
>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> columnNumber: 571; Character reference "&#"
>> in uima-as framework installed when trying to deserialize the cas
>> deserializeCasFromXmi(),to be processed for the service.
>>
>> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
>>> Hi Nelson,
>>>
>>> I can't see the characters (sorry).
>>>
>>> This might be an issue caused by a discrepancy between the coding of the
>>> file
>>> being read, and the coding indicated on the xml header.  Can you check
>>> that
>>> those two things are the same?
>>>
>>> See
>>> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
>>> for example.
>>>
>>> -Marshall
>>>
>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
 i tried to proccess the following text in a service deploy in uima-as,
 because is input of my application. This is the text : 𖦀  𖦐  �  �.
 These characters correspond to the bamun language, and apparently are
 not  invalid xml characters because tools such as browsers interpret
 it and show it. After get a new input cas to proccesing, set the text
 and send the request, i get  the exception that i show below in
 uima-as, the framework uima-as work and recovers correctly, just not
 process this characters.
 Could you tell me what happens with these characters, one of these is
 invalid characters for framework uima-as?



 04:00:31.606 - 14:
 org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
 WARNING:
 org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
 Character reference "&#
 at
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
 at
 org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
 at
 org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
 at
 org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
 at
 org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
 at
 org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
 at
 org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)

>>>
>
>


Re: Proccesing Bamun characters

2016-12-12 Thread Marshall Schor
Hi Nelson,

Looking into this... Can you please confirm that the UTF-8 coding of the
troublesome characters, in hexadecimal, is:

F0 96 A6 80

F0 96 A6 90

EF BF BD

EF BF BD

If you have the string in Java, please try converting it to a UTF-8 string using
something like:
  byte[] theBytes = myTestString.getBytes("UTF-8");

  and then print out theBytes in hex; they should look like the above.  If not,
please let us know what the values is instead.


Thanks. -Marshall


On 12/9/2016 9:02 AM, nelson rivera wrote:
> Hi i was read your explication and saw the link, but in my case, i
> don't read any xml file. Just i copy the text, get a new input cas
> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
> side. Apparently the characters are changed for its entities
> corresponding when serialize the cas to send it, but i get the
> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
> columnNumber: 571; Character reference "&#"
> in uima-as framework installed when trying to deserialize the cas
> deserializeCasFromXmi(),to be processed for the service.
>
> 2016-12-08 16:48 GMT-05:00, Marshall Schor :
>> Hi Nelson,
>>
>> I can't see the characters (sorry).
>>
>> This might be an issue caused by a discrepancy between the coding of the
>> file
>> being read, and the coding indicated on the xml header.  Can you check that
>> those two things are the same?
>>
>> See
>> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
>> for example.
>>
>> -Marshall
>>
>> On 12/8/2016 4:20 PM, nelson rivera wrote:
>>> i tried to proccess the following text in a service deploy in uima-as,
>>> because is input of my application. This is the text : 𖦀  𖦐  �  �.
>>> These characters correspond to the bamun language, and apparently are
>>> not  invalid xml characters because tools such as browsers interpret
>>> it and show it. After get a new input cas to proccesing, set the text
>>> and send the request, i get  the exception that i show below in
>>> uima-as, the framework uima-as work and recovers correctly, just not
>>> process this characters.
>>> Could you tell me what happens with these characters, one of these is
>>> invalid characters for framework uima-as?
>>>
>>>
>>>
>>> 04:00:31.606 - 14:
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
>>> WARNING:
>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>>> Character reference "&#
>>> at
>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
>>> at
>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
>>> at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
>>> at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
>>> at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
>>> at
>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
>>> at
>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>>>
>>



Re: Proccesing Bamun characters

2016-12-09 Thread nelson rivera
Hi i was read your explication and saw the link, but in my case, i
don't read any xml file. Just i copy the text, get a new input cas
from UimaAsynchronousEngine with getCAS(), set the text in the cas and
send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
side. Apparently the characters are changed for its entities
corresponding when serialize the cas to send it, but i get the
mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
columnNumber: 571; Character reference "&#"
in uima-as framework installed when trying to deserialize the cas
deserializeCasFromXmi(),to be processed for the service.

2016-12-08 16:48 GMT-05:00, Marshall Schor :
> Hi Nelson,
>
> I can't see the characters (sorry).
>
> This might be an issue caused by a discrepancy between the coding of the
> file
> being read, and the coding indicated on the xml header.  Can you check that
> those two things are the same?
>
> See
> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
> for example.
>
> -Marshall
>
> On 12/8/2016 4:20 PM, nelson rivera wrote:
>> i tried to proccess the following text in a service deploy in uima-as,
>> because is input of my application. This is the text : 𖦀  𖦐  �  �.
>> These characters correspond to the bamun language, and apparently are
>> not  invalid xml characters because tools such as browsers interpret
>> it and show it. After get a new input cas to proccesing, set the text
>> and send the request, i get  the exception that i show below in
>> uima-as, the framework uima-as work and recovers correctly, just not
>> process this characters.
>> Could you tell me what happens with these characters, one of these is
>> invalid characters for framework uima-as?
>>
>>
>>
>> 04:00:31.606 - 14:
>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
>> WARNING:
>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> Character reference "&#
>> at
>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
>> at
>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
>> at
>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
>> at
>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
>> at
>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
>> at
>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
>> at
>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>>
>
>


Re: Proccesing Bamun characters

2016-12-08 Thread Marshall Schor
Hi Nelson,

I can't see the characters (sorry).

This might be an issue caused by a discrepancy between the coding of the file
being read, and the coding indicated on the xml header.  Can you check that
those two things are the same?

See
http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
for example.

-Marshall

On 12/8/2016 4:20 PM, nelson rivera wrote:
> i tried to proccess the following text in a service deploy in uima-as,
> because is input of my application. This is the text : 𖦀  𖦐  �  �.
> These characters correspond to the bamun language, and apparently are
> not  invalid xml characters because tools such as browsers interpret
> it and show it. After get a new input cas to proccesing, set the text
> and send the request, i get  the exception that i show below in
> uima-as, the framework uima-as work and recovers correctly, just not
> process this characters.
> Could you tell me what happens with these characters, one of these is
> invalid characters for framework uima-as?
>
>
>
> 04:00:31.606 - 14:
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
> WARNING:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> Character reference "&#
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
> at 
> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
> at 
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
> at 
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
> at 
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
> at 
> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
> at 
> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>



Proccesing Bamun characters

2016-12-08 Thread nelson rivera
i tried to proccess the following text in a service deploy in uima-as,
because is input of my application. This is the text : 𖦀  𖦐  �  �.
These characters correspond to the bamun language, and apparently are
not  invalid xml characters because tools such as browsers interpret
it and show it. After get a new input cas to proccesing, set the text
and send the request, i get  the exception that i show below in
uima-as, the framework uima-as work and recovers correctly, just not
process this characters.
Could you tell me what happens with these characters, one of these is
invalid characters for framework uima-as?



04:00:31.606 - 14:
org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
WARNING:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
Character reference "&#
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at 
org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
at 
org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
at 
org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
at 
org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
at 
org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
at 
org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)