Re: Proccesing Bamun characters

2016-12-16 Thread nelson rivera
This is the cas serialize to xmi before send to uima-as service,
serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
The representation of the characters In this serialization does not
match with the representation of characters with problems. It's being
serialized the code points escape sequences corresponding to the Bamum
characters, two code point by each character.
Why can this happen? Any suggestions

http:///uima/cas.ecore; xmlns:xmi="http://www.omg.org/XMI;
xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore;
xmlns:tcas="http:///uima/tcas.ecore;
xmlns:api="http:///cu/datys/xinetica/uima/api.ecore;
xmi:version="2.0">


2016-12-16 14:06 GMT-05:00, Burn Lewis :
> Sorry, I missed the supplement set.  So the tests I did with x16980 &
> x16990 are valid.  runRemoteAsyncAE uses the same
> FileSystemCollectionReader as runAE does ... did you use a different
> collection reader?  If a custom one perhaps you could serialize the cas to
> a file as XMI and verify that the XMI is legal.
>
> On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera 
> wrote:
>
>> In Wikipedia the Bamum
>> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>> valid range is U+16800–U+16A3F, any of theses characters generate the
>> same log trace. I will continue to test the  Marshall Schor
>> suggestion.
>>
>> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
>> > I think there's another problem ... the characters we have tested with
>> are
>> > not in the Bamum unicode set.  The first 2 that Marshall listed in
>> > utf-8
>> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF
>> > BF
>> > BD) is xFFFD.  This last one is the "replacement character" used when
>> > an
>> > illegal character is encountered.  According to Wikipedia the 88 Bamum
>> > characters are in the range xA6A0 - xA6F7.
>> >
>> > In order to reproduce your problem we need to yse the same codepoints.
>> Can
>> > you tell us what the hex value of the failing characters are, in UTF-8
>> > or
>> > UTF-!6?
>> >
>> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
>> runAE,
>> > following the quick test described in the UIMA-AS README.
>> >
>> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
>> >
>> >> Maybe we've been on the wrong line of thinking.
>> >>
>> >> Perhaps the translation between UTF-8 (during transportation) and the
>> >> string
>> >> characters is fine, but the XML parsing is restricting the character
>> >> set
>> >> it uses.
>> >>
>> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>> >>
>> >> where it says valid xml characters exclude the "surrogates", which
>> >> your
>> >> characters I think are.
>> >>
>> >> So, perhaps it's XML parsing which is complaining (and it appears this
>> is
>> >> so,
>> >> from the stack trace).
>> >>
>> >> We should point out that UIMA's character offsets (like begin an end)
>> >> were
>> >> designed with Java String character offsets, and will perhaps not work
>> >> correctly
>> >> when surrogates are being used.
>> >>
>> >> A possible workaround for this particular issue may be to switch to
>> >> binary
>> >> serialization, instead of xmi serialization. This has a restriction in
>> >> that the
>> >> type systems much be identical (between the client and server).
>> >>
>> >> We could possibly get more confirmation of this hypothesis if you
>> >> could
>> >> say what
>> >> the stack trace was, beyond the first bit which you stated in your
>> >> original
>> >> note.  There should be more stack trace information, further down,
>> >> starting with
>> >> "caused by ..." which may provide more helpful information.
>> >>
>> >> -Marshall
>> >>
>> >>
>> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> >> > We also did that test with uima framework and RunAE tool and
>> >> > thecharacters in a file as you, and effectively not exist problem.
>> >> > The
>> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
>> >> > when trying to deserialize the cas deserializeCasFromXmi() in remote
>> >> > uima-as service, that  i get the mentioned exception
>> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> >> > Character reference "

Re: Proccesing Bamun characters

2016-12-16 Thread Burn Lewis
Sorry, I missed the supplement set.  So the tests I did with x16980 &
x16990 are valid.  runRemoteAsyncAE uses the same
FileSystemCollectionReader as runAE does ... did you use a different
collection reader?  If a custom one perhaps you could serialize the cas to
a file as XMI and verify that the XMI is legal.

On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera 
wrote:

> In Wikipedia the Bamum
> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
> valid range is U+16800–U+16A3F, any of theses characters generate the
> same log trace. I will continue to test the  Marshall Schor
> suggestion.
>
> 2016-12-14 18:07 GMT-05:00, Burn Lewis :
> > I think there's another problem ... the characters we have tested with
> are
> > not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> > BD) is xFFFD.  This last one is the "replacement character" used when an
> > illegal character is encountered.  According to Wikipedia the 88 Bamum
> > characters are in the range xA6A0 - xA6F7.
> >
> > In order to reproduce your problem we need to yse the same codepoints.
> Can
> > you tell us what the hex value of the failing characters are, in UTF-8 or
> > UTF-!6?
> >
> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
> runAE,
> > following the quick test described in the UIMA-AS README.
> >
> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
> >
> >> Maybe we've been on the wrong line of thinking.
> >>
> >> Perhaps the translation between UTF-8 (during transportation) and the
> >> string
> >> characters is fine, but the XML parsing is restricting the character set
> >> it uses.
> >>
> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
> >>
> >> where it says valid xml characters exclude the "surrogates", which your
> >> characters I think are.
> >>
> >> So, perhaps it's XML parsing which is complaining (and it appears this
> is
> >> so,
> >> from the stack trace).
> >>
> >> We should point out that UIMA's character offsets (like begin an end)
> >> were
> >> designed with Java String character offsets, and will perhaps not work
> >> correctly
> >> when surrogates are being used.
> >>
> >> A possible workaround for this particular issue may be to switch to
> >> binary
> >> serialization, instead of xmi serialization. This has a restriction in
> >> that the
> >> type systems much be identical (between the client and server).
> >>
> >> We could possibly get more confirmation of this hypothesis if you could
> >> say what
> >> the stack trace was, beyond the first bit which you stated in your
> >> original
> >> note.  There should be more stack trace information, further down,
> >> starting with
> >> "caused by ..." which may provide more helpful information.
> >>
> >> -Marshall
> >>
> >>
> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
> >> > We also did that test with uima framework and RunAE tool and
> >> > thecharacters in a file as you, and effectively not exist problem. The
> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> >> > when trying to deserialize the cas deserializeCasFromXmi() in remote
> >> > uima-as service, that  i get the mentioned exception
> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >> > Character reference "

[ANNOUNCE] Apache UIMA AS 2.9.0 released

2016-12-16 Thread Jaroslaw Cwiklik
The Apache UIMA team is pleased to announce the release of the Apache
UIMA-AS version 2.9.0, which includes asynchronous scaleout capabilities
for the UIMA annotators.

UIMA-AS includes the base UIMA SDK and augments it with scaleout
capability; it is a next-generation replacement for the original CPM
(Collection Processing Management) scaleout that is part of the core UIMA
Framework. For more information, please visit:

http://uima.apache.org/doc-uimaas-what.html

This release contains a number of improvements and bug fixes. Notable
updates in this release include:

  - Updated to use Activemq 5.14.0
  - Added dependency on UIMA SDK 2.9.0
  - Fixed http based service connectivity
  - Added support for automatic recovery of temp queues after broker
restart.
  - Fixed per CAS Performance Metrics breakdown
  - Fixed per process CPU & RSS reporting
  - Fixed example runtime configurations
  - Fixed error recovery on exception while deserializing a CAS
  - "Pinned" JMX MBeans to a specific deployment to enable orderly cleanup
  - Fixed support of AMQ white listing of packages
  - Added support to disable JMX via a new argument
-Duima.as.enable.jmx=false
  - Fixed dd2spring issues
  - Updated version checker to test compatibility with UIMA SDK.

For a complete list of bugs and improvements included in this release
please see
https://uima.apache.org/d/uima-as-2.9.0/issuesFixed/jira-report.html.

-- Jerry Cwiklik, for the Apache UIMA development team


Re: Proccesing Bamun characters

2016-12-16 Thread nelson rivera
In Wikipedia the Bamum
Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
valid range is U+16800–U+16A3F, any of theses characters generate the
same log trace. I will continue to test the  Marshall Schor
suggestion.

2016-12-14 18:07 GMT-05:00, Burn Lewis :
> I think there's another problem ... the characters we have tested with are
> not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> BD) is xFFFD.  This last one is the "replacement character" used when an
> illegal character is encountered.  According to Wikipedia the 88 Bamum
> characters are in the range xA6A0 - xA6F7.
>
> In order to reproduce your problem we need to yse the same codepoints.  Can
> you tell us what the hex value of the failing characters are, in UTF-8 or
> UTF-!6?
>
> By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE,
> following the quick test described in the UIMA-AS README.
>
> On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor  wrote:
>
>> Maybe we've been on the wrong line of thinking.
>>
>> Perhaps the translation between UTF-8 (during transportation) and the
>> string
>> characters is fine, but the XML parsing is restricting the character set
>> it uses.
>>
>> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>>
>> where it says valid xml characters exclude the "surrogates", which your
>> characters I think are.
>>
>> So, perhaps it's XML parsing which is complaining (and it appears this is
>> so,
>> from the stack trace).
>>
>> We should point out that UIMA's character offsets (like begin an end)
>> were
>> designed with Java String character offsets, and will perhaps not work
>> correctly
>> when surrogates are being used.
>>
>> A possible workaround for this particular issue may be to switch to
>> binary
>> serialization, instead of xmi serialization. This has a restriction in
>> that the
>> type systems much be identical (between the client and server).
>>
>> We could possibly get more confirmation of this hypothesis if you could
>> say what
>> the stack trace was, beyond the first bit which you stated in your
>> original
>> note.  There should be more stack trace information, further down,
>> starting with
>> "caused by ..." which may provide more helpful information.
>>
>> -Marshall
>>
>>
>> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> > We also did that test with uima framework and RunAE tool and
>> > thecharacters in a file as you, and effectively not exist problem. The
>> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
>> > when trying to deserialize the cas deserializeCasFromXmi() in remote
>> > uima-as service, that  i get the mentioned exception
>> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> > Character reference "