If it helps, there is another Spanish corpus at CONLL02 page which has 3 fields: "Xavier Carreras provides the Spanish data sets with part of speech tags<http://www.lsi.upc.es/~nlp/tools/nerc/nerc.html> (20030803)"
William 2014-03-12 9:43 GMT-03:00 Roque Vera <roqu...@gmail.com>: > I found an issue in TokenNamedFinderConverter module. Specifically I try to > convert a file in CoNLL 2002 format into OpenNLP one. The error I get when > I execute "opennlp TokenNameFinderConverter conll02 -data esp.testa -lang > es -types per > corpus_testa.txt" on the command line interface is: > > > > > > > > > *IO error while converting data : Expected three fields per line in > training data, got 2 for line 'Sao B-LOC'! Expected three fields per line > in training data, got 2 for line 'Sao B-LOC'! java.io.IOException: Expected > three fields per line in training data, got 2 for line 'Sao B-LOC'! > at > > opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:140) > at > > opennlp.tools.formats.Conll02NameSampleStream.read(Conll02NameSampleStream.java:49) > at > > opennlp.tools.cmdline.AbstractConverterTool.run(AbstractConverterTool.java:110) > at opennlp.tools.cmdline.CLI.main(CLI.java:222).* > > > > The reason is clear; three fields are expected from my file "esp.testa" > that only has two. But, the curious thing is that the file is from CoNLL's > data-set for test. > > > I propose two solutions for this problem. The first is to add a third field > intermediately to the two existed. For example, originally the file may > contains a line in IOB2-format like: "Sao B-LOC", and we must have to > change it to "Sao VP B-LOC", where "VP" is a POS tag that, in term of the > implementation, doesn't really matter what it means. I create a modified > version of the test data-set accordantly to this detail. > > > The other possible solution is to change the code from > > "apache-opennlp-1.5.3-src\opennlp-tools\src\main\java\opennlp\tools\formats\Conll02NameSampleStream.java", > beginning in line 133. The solution is given in the following table, where > the first column contains the original code and the second the proposed > solution. > > String fields[] = line.split(" "); > > if (fields.length == 3) { > > sentence.add(fields[0]); > > tags.add(fields[2]); > > } > > else { > > throw new IOException("Expected three fields per line in training > data, got " + > > fields.length + " for line '" + line + "'!"); > > } > > String fields[] = line.split(" "); > > if (fields.length == 3) { > > sentence.add(fields[0]); > > tags.add(fields[2]); > > } > > if (fields.length == 2){ > > sentence.add(fields[0]); > > tags.add(fields[1]); > > } > > else { > > throw new IOException("Expected three or two fields per line in > training data, got " + > > fields.length + " for line '" + line + "'!"); > > } > > The first "if" statement is necessary because the training data-set of > CoNLL have three fields. Note that the second "if" statement only serves to > the test data-set (that is the case in which I have problem). > > > I hope this suggestion help to solve this problem. > > Frankly, > > Roque Vera. > Facultad Politécnica, Universidad Nacional de Asunción. > Paraguay. >