Re: CoNLL02 format issue

William Colen Wed, 12 Mar 2014 06:18:13 -0700

If it helps, there is another Spanish corpus at CONLL02 page which has 3
fields:
   "Xavier Carreras provides the Spanish data sets with part of speech
tags<http://www.lsi.upc.es/~nlp/tools/nerc/nerc.html>
 (20030803)"


William


2014-03-12 9:43 GMT-03:00 Roque Vera <roqu...@gmail.com>:

> I found an issue in TokenNamedFinderConverter module. Specifically I try to
> convert a file in CoNLL 2002 format into OpenNLP one. The error I get when
> I execute "opennlp TokenNameFinderConverter conll02 -data esp.testa -lang
> es -types per > corpus_testa.txt" on the command line interface is:
>
>
>
>
>
>
>
>
> *IO error while converting data : Expected three fields per line in
> training data, got 2 for line 'Sao B-LOC'! Expected three fields per line
> in training data, got 2 for line 'Sao B-LOC'! java.io.IOException: Expected
> three fields per line in training data, got 2 for line 'Sao B-LOC'!
> at
>
> opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:140)
>         at
>
> opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:49)
>         at
>
> opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:110)
>         at opennlp.tools.cmdline.CLI.main(CLI.java:222).*
>
>
>
> The reason is clear; three fields are expected from my file "esp.testa"
> that only has two. But, the curious thing is that the file is from CoNLL's
> data-set for test.
>
>
> I propose two solutions for this problem. The first is to add a third field
> intermediately to the two existed. For example, originally the file may
> contains a line in IOB2-format like: "Sao B-LOC", and we must have to
> change it to "Sao VP B-LOC", where "VP" is a POS tag that, in term of the
> implementation, doesn't really matter what it means. I create a modified
> version of the test data-set accordantly to this detail.
>
>
> The other possible solution is to change the code from
>
> "apache-opennlp-1.5.3-src\opennlp-tools\src\main\java\opennlp\tools\formats\Conll02NameSampleStream.java",
> beginning in line 133. The solution is given in the following table, where
> the first column contains the original code and the second the proposed
> solution.
>
> String fields[] = line.split(" ");
>
>       if (fields.length == 3) {
>
>         sentence.add(fields[0]);
>
>         tags.add(fields[2]);
>
>       }
>
>       else {
>
>         throw new IOException("Expected three fields per line in training
> data, got " +
>
>             fields.length + " for line '" + line + "'!");
>
>       }
>
> String fields[] = line.split(" ");
>
>       if (fields.length == 3) {
>
>         sentence.add(fields[0]);
>
>         tags.add(fields[2]);
>
>       }
>
>       if (fields.length  == 2){
>
>         sentence.add(fields[0]);
>
>         tags.add(fields[1]);
>
>       }
>
>       else {
>
>         throw new IOException("Expected three or two fields per line in
> training data, got " +
>
>             fields.length + " for line '" + line + "'!");
>
>       }
>
> The first "if" statement is necessary because the training data-set of
> CoNLL have three fields. Note that the second "if" statement only serves to
> the test data-set (that is the case in which I have problem).
>
>
> I hope this suggestion help to solve this problem.
>
> Frankly,
>
> Roque Vera.
> Facultad Politécnica, Universidad Nacional de Asunción.
> Paraguay.
>

Re: CoNLL02 format issue

Reply via email to