[ https://issues.apache.org/jira/browse/OPENNLP-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner updated OPENNLP-1512: ------------------------------------ Description: While working on OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002] {{I ran }} {{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt}} When I checked the output corpus (txt) file, I noticed incorrect symbols being written there. A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent). Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in ISO_8859_1. With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file. was: While working on OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002] I ran opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt When I checked the output corpus (txt) file, I noticed incorrect symbols being written there. A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent). Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in ISO_8859_1. With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file. > Fix incorrect encoding used in Conll02NameSampleStream > ------------------------------------------------------ > > Key: OPENNLP-1512 > URL: https://issues.apache.org/jira/browse/OPENNLP-1512 > Project: OpenNLP > Issue Type: Improvement > Components: Formats, Name Finder > Affects Versions: 2.3.0 > Reporter: Martin Wiesner > Assignee: Martin Wiesner > Priority: Minor > Fix For: 2.3.1 > > > While working on OPENNLP-1190, I tested the example from the OpenNLP > documentation to convert the Esp.train example to the OpenNLP format, see: > [https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002] > {{I ran }} > {{opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types > per > es_corpus_train_persons.txt}} > When I checked the output corpus (txt) file, I noticed incorrect symbols > being written there. > A quick debugging session revealed that the original files where ISO_8859_1 > encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was > assumed. This results in accents or other special symbols of the spanish > alphabet being converted to garbage in the resulting UTF-8 encoded file > (reason: input character-set interpretation inconsistent). > Therefore, _Conll02NameSampleStream_ needs a fix to read the original files > in ISO_8859_1. > With this measure in place, the accents á, é, ... are correctly written to > the resulting converted training corpus file. -- This message was sent by Atlassian Jira (v8.20.10#820010)