[jira] [Updated] (OPENNLP-1512) Fix incorrect encoding used in Conll02NameSampleStream

Martin Wiesner (Jira) Fri, 01 Sep 2023 08:54:05 -0700


     [ 
https://issues.apache.org/jira/browse/OPENNLP-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martin Wiesner updated OPENNLP-1512:
------------------------------------
    Description: 
While working on OPENNLP-1190, I tested the example from the OpenNLP 
documentation to convert the Esp.train example to the OpenNLP format, see: 
[https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]

{{I ran }}
{{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per 
> es_corpus_train_persons.txt}}

When I checked the output corpus (txt) file, I noticed incorrect symbols being 
written there.

A quick debugging session revealed that the original files where ISO_8859_1 
encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was 
assumed. This results in accents or other special symbols of the spanish 
alphabet being converted to garbage in the resulting UTF-8 encoded file 
(reason: input character-set interpretation inconsistent).

Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in 
ISO_8859_1.

With this measure in place, the accents á, é, ... are correctly written to the 
resulting converted training corpus file. 

  was:
While working on OPENNLP-1190, I tested the example from the OpenNLP 
documentation to convert the Esp.train example to the OpenNLP format, see: 
[https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]

I ran 
opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > 
es_corpus_train_persons.txt
When I checked the output corpus (txt) file, I noticed incorrect symbols being 
written there. 

A quick debugging session revealed that the original files where ISO_8859_1 
encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was 
assumed. This results in accents or other special symbols of the spanish 
alphabet being converted to garbage in the resulting UTF-8 encoded file 
(reason: input character-set interpretation inconsistent).

Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in 
ISO_8859_1.

With this measure in place, the accents á, é, ... are correctly written to the 
resulting converted training corpus file. 


> Fix incorrect encoding used in Conll02NameSampleStream
> ------------------------------------------------------
>
>                 Key: OPENNLP-1512
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1512
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Formats, Name Finder
>    Affects Versions: 2.3.0
>            Reporter: Martin Wiesner
>            Assignee: Martin Wiesner
>            Priority: Minor
>             Fix For: 2.3.1
>
>
> While working on OPENNLP-1190, I tested the example from the OpenNLP 
> documentation to convert the Esp.train example to the OpenNLP format, see: 
> [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]
> {{I ran }}
> {{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types 
> per > es_corpus_train_persons.txt}}
> When I checked the output corpus (txt) file, I noticed incorrect symbols 
> being written there.
> A quick debugging session revealed that the original files where ISO_8859_1 
> encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was 
> assumed. This results in accents or other special symbols of the spanish 
> alphabet being converted to garbage in the resulting UTF-8 encoded file 
> (reason: input character-set interpretation inconsistent).
> Therefore, _Conll02NameSampleStream_ needs a fix to read the original files 
> in ISO_8859_1.
> With this measure in place, the accents á, é, ... are correctly written to 
> the resulting converted training corpus file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (OPENNLP-1512) Fix incorrect encoding used in Conll02NameSampleStream

Reply via email to