Re: Problem in running DUCC Job for Arabic Language

2018-11-06 Thread Jaroslaw Cwiklik
Forgot to mention that if you have a shared file system the best practice
is not to serialize your content (SOFA)
from JD to service. Instead, in a CR add a path to the file containing
Subject of Analysis to the CAS and have
the CM in the pipeline read the content from the shared file system.
-jerry


On Tue, Nov 6, 2018 at 9:37 AM Jaroslaw Cwiklik  wrote:

> Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
> process and -Djavax.servlet.request.encoding=ISO-8859-1
> -Dfile.encoding=ISO-8859-1 for the JD process.
>
> The JD actually uses Jetty webserver to serve service requests over HTTP.
> I went as far as extracting Jetty server code from JD into a simple http
> server process and also extracted HttpClient related code from the service
> into a simple client process to be able to test.
>
> So on the server side I have:
> String text = new String("استعرض المتحدث باسم قوات «التحالف العربي
> لدعم".getBytes("UTF-8"),"ISO-8859-1");
> response.setHeader("content-type", "text/xml");
> String body = marshall(text);   // XStream serialization
> response.getWriter().write(body);
>
> On the client side:
>   System.out.println("Default Locale:   " + Locale.getDefault());
>   System.out.println("Default Charset:  " + Charset.defaultCharset());
>   System. out.println("file.encoding;" +
> System.getProperty("file.encoding"));
>
> HttpResponse response = httpClient.execute(postMethod);
> HttpEntity entity = response.getEntity();
> String content = EntityUtils.toString(entity);
>  String result = (String) unmarshall(content); //XStream unmarshall
>   String o = new String(result.getBytes() );
> System.out.println(o);
>
> When I run with the above -D settings the client console shows:
> Default Locale:   en_US
> Default Charset:  ISO-8859-1
> file.encoding;ISO-8859-1
>
> استعرض المتحدث باسم قوات «التحالف العربي لدعم
>
> Without the -D's I dont see arabic text and instead see garbage on the
> console.
>
> On Fri, Jul 6, 2018 at 3:00 AM rohit14csu...@ncuindia.edu <
> rohit14csu...@ncuindia.edu> wrote:
>
>> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
>> UIMA-AS client it works fine.
>> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
>> and DUCC JOB.
>>
>> Environ[3] = LANG=en_IN
>>
>> And if i change the LANG=ar then while getting the data coming in JD the
>> arabic text is already replaced with ???(Question Mark) and the encoding of
>> the data coming in JD or CR  shows ASCII encoding.
>> I don't understand why is this happening.
>>
>> Best
>> Rohit
>>
>>
>> On 2018/07/05 13:35:11, Eddie Epstein  wrote:
>> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
>> some
>> > UIMA-AS client it works OK? The full environment for all processes that
>> > DUCC launches are available via ducc-mon under the Specification or
>> > Registry tab for that job or managed reservation or service. Please see
>> if
>> > the LANG setting for the service is different from the LANG setting for
>> the
>> > job.
>> >
>> > One can also see the LANG setting for a linux process-id by doing:
>> >
>> > cat /proc//environ
>> >
>> > The LANG to be used for a DUCC process can be set by adding to the
>> > --environment argument "LANG=xxx" as needed
>> >
>> > Thanks,
>> > Eddie
>> >
>> >
>> >
>> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu...@ncuindia.edu <
>> > rohit14csu...@ncuindia.edu> wrote:
>> >
>> > > Hey,
>> > >  Yeah you got it right the first snippet comes in CR before the data
>> goes
>> > > in CAS.
>> > > And the second snippet is in the first annotator or analysis
>> engine(AE) of
>> > > my Aggregate Desciptor.
>> > > I am pretty sure this is an issue of the CAS used by DUCC because if
>> i use
>> > > service of DUCC in which we are supposed to send the CAS and receive
>> the
>> > > same CAS with added features from DUCC i get accurate results.
>> > >
>> > > But the only problem comes in submitting a job where the cas is
>> generated
>> > > by DUCC.
>> > > This can also be a issue of the enviornment(Language) of DUCC because
>> the
>> > > default language is english.
>> > >
>> > > Bets Regards
>> > > Rohit
>> > >
>> > > On 2018/07/03 13:11:50, Eddie Epstein  wrote:
>> > > > Rohit,
>> > > >
>> > > > Before sending the data into jcas if i force encode it :-
>> > > > >
>> > > > > String content2 = null;
>> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
>> > > > > jcas.setDocumentText(content2);
>> > > > >
>> > > >
>> > > > Where is this code, in the job CR?
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > > > And when i go in my first annotator i force decode it:-
>> > > > >
>> > > > > String content = null;
>> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
>> > > > > "UTF-8");
>> > > > >
>> > > >
>> > > > And is this in the first annotator of the job process, i.e. the CM?
>> > > >
>> > > > Please be as specific as possible.
>> > > >
>> > > > Thanks,

Re: Problem in running DUCC Job for Arabic Language

2018-11-06 Thread Jaroslaw Cwiklik
Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
process and -Djavax.servlet.request.encoding=ISO-8859-1
-Dfile.encoding=ISO-8859-1 for the JD process.

The JD actually uses Jetty webserver to serve service requests over HTTP. I
went as far as extracting Jetty server code from JD into a simple http
server process and also extracted HttpClient related code from the service
into a simple client process to be able to test.

So on the server side I have:
String text = new String("استعرض المتحدث باسم قوات «التحالف العربي
لدعم".getBytes("UTF-8"),"ISO-8859-1");
response.setHeader("content-type", "text/xml");
String body = marshall(text);   // XStream serialization
response.getWriter().write(body);

On the client side:
  System.out.println("Default Locale:   " + Locale.getDefault());
  System.out.println("Default Charset:  " + Charset.defaultCharset());
  System. out.println("file.encoding;" +
System.getProperty("file.encoding"));

HttpResponse response = httpClient.execute(postMethod);
HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
 String result = (String) unmarshall(content); //XStream unmarshall
  String o = new String(result.getBytes() );
System.out.println(o);

When I run with the above -D settings the client console shows:
Default Locale:   en_US
Default Charset:  ISO-8859-1
file.encoding;ISO-8859-1

استعرض المتحدث باسم قوات «التحالف العربي لدعم

Without the -D's I dont see arabic text and instead see garbage on the
console.

On Fri, Jul 6, 2018 at 3:00 AM rohit14csu...@ncuindia.edu <
rohit14csu...@ncuindia.edu> wrote:

> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
> UIMA-AS client it works fine.
> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
> and DUCC JOB.
>
> Environ[3] = LANG=en_IN
>
> And if i change the LANG=ar then while getting the data coming in JD the
> arabic text is already replaced with ???(Question Mark) and the encoding of
> the data coming in JD or CR  shows ASCII encoding.
> I don't understand why is this happening.
>
> Best
> Rohit
>
>
> On 2018/07/05 13:35:11, Eddie Epstein  wrote:
> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
> some
> > UIMA-AS client it works OK? The full environment for all processes that
> > DUCC launches are available via ducc-mon under the Specification or
> > Registry tab for that job or managed reservation or service. Please see
> if
> > the LANG setting for the service is different from the LANG setting for
> the
> > job.
> >
> > One can also see the LANG setting for a linux process-id by doing:
> >
> > cat /proc//environ
> >
> > The LANG to be used for a DUCC process can be set by adding to the
> > --environment argument "LANG=xxx" as needed
> >
> > Thanks,
> > Eddie
> >
> >
> >
> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu...@ncuindia.edu <
> > rohit14csu...@ncuindia.edu> wrote:
> >
> > > Hey,
> > >  Yeah you got it right the first snippet comes in CR before the data
> goes
> > > in CAS.
> > > And the second snippet is in the first annotator or analysis
> engine(AE) of
> > > my Aggregate Desciptor.
> > > I am pretty sure this is an issue of the CAS used by DUCC because if i
> use
> > > service of DUCC in which we are supposed to send the CAS and receive
> the
> > > same CAS with added features from DUCC i get accurate results.
> > >
> > > But the only problem comes in submitting a job where the cas is
> generated
> > > by DUCC.
> > > This can also be a issue of the enviornment(Language) of DUCC because
> the
> > > default language is english.
> > >
> > > Bets Regards
> > > Rohit
> > >
> > > On 2018/07/03 13:11:50, Eddie Epstein  wrote:
> > > > Rohit,
> > > >
> > > > Before sending the data into jcas if i force encode it :-
> > > > >
> > > > > String content2 = null;
> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > > > jcas.setDocumentText(content2);
> > > > >
> > > >
> > > > Where is this code, in the job CR?
> > > >
> > > >
> > > >
> > > > >
> > > > > And when i go in my first annotator i force decode it:-
> > > > >
> > > > > String content = null;
> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > > > "UTF-8");
> > > > >
> > > >
> > > > And is this in the first annotator of the job process, i.e. the CM?
> > > >
> > > > Please be as specific as possible.
> > > >
> > > > Thanks,
> > > > Eddie
> > > >
> > >
> >
>