OpenNLP is designed to support many formats for training, but we had to
decide
on one default format, and that is the one which was always supported.
We can support the proposed TCF Format, are you interested to contribute
parsing code for it?
Jörn
On 10/14/2013 09:59 PM, Thomas Zastrow wrote:
Hello,
In any case, I think its a little bit oldschool to identify tokens and
additional annotations just with spaces between them ... what about a
nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe
NEGRA?
Best,
Tom
[1]
http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format
Am 14.10.2013 21:53, schrieb Charles Martin:
What happens if all the entity tokens are at the beginning of every line?
I find that openlp then thinks that any string near the beginning of a line
is an entity,
regardless of the content or word context
On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow <[email protected]>wrote:
Thanks. That explains a lot ... :-)
Does it play a role it it is one or two blanks?
Am 14.10.2013 21:44, schrieb William Colen:
Yes, it does. Include a blank between any element, including punctuations
and annotations. The corpus must be tokenized.
2013/10/14 Thomas Zastrow <[email protected]>
Hello,
I have a question: when creating training material, does it make a
difference if there are " " (blanks) around the NE? In other words, is
it the same to have:
<START:loc>Hamburg<END>
or:
<START:loc> Hamburg <END>
The example in the documentation shows up with the " " ... ?
Best,
Tom
P.S.: ca. 1300 sentences for a free German NE model are done :-)