What is the legitimacy of data which is tagged using an encumbered model? I mean, if I tag documents with OpenNLP's non-free models on sourceforge, the tagged output is a "derived work". Is this tagged output considered free? Does this depend on the license of the original data?
On Wed, Jul 18, 2012 at 1:28 AM, Jörn Kottmann <[email protected]> wrote: > On 07/18/2012 04:30 AM, Lance Norskog wrote: >> >> Please use unencumbered training data for all future OpenNLP projects. > > > We of course would like to do that, but it is not that easy. > For coreference there is no good data set which is available > under some kind of Open Source license. > > The only way to *fix* that is to produce your own training > data based on a text source which can be shared under an > OS license. > > We started to work on making tooling to crowd source such annotations, > but we still need to do a lot to finish this. So any help in this area is > very welcome. > > >> What exactly does a coref training dataset have to include? What kind >> of tagging or cross-referencing? > > > - Full or shallow parse > - Named Entities > - Linked mentions > > Have a look at this thread: > http://mail-archives.apache.org/mod_mbox/opennlp-dev/201203.mbox/%[email protected]%3E > > I proposed the new format there and then implemented it. > > For OntoNotes we need to do some adaption to get it into something > you can use for training, e.g. filtering verb mentions, doing the parsing, > etc. > If we get it trained nicely on this dataset it would be a good step forward. > > Jörn > -- Lance Norskog [email protected]
