What is the legitimacy of data which is tagged using an encumbered
model? I mean, if I tag documents with OpenNLP's non-free models on
sourceforge, the tagged output is a "derived work". Is this tagged
output considered free? Does this depend on the license of the
original data?

On Wed, Jul 18, 2012 at 1:28 AM, Jörn Kottmann <[email protected]> wrote:
> On 07/18/2012 04:30 AM, Lance Norskog wrote:
>>
>> Please use unencumbered training data for all future OpenNLP projects.
>
>
> We of course would like to do that, but it is not that easy.
> For coreference there is no good data set which is available
> under some kind of Open Source license.
>
> The only way to *fix* that is to produce your own training
> data based on a text source which can be shared under an
> OS license.
>
> We started to work on making tooling to crowd source such annotations,
> but we still need to do a lot to finish this. So any help in this area is
> very welcome.
>
>
>> What exactly does a coref training dataset have to include? What kind
>> of tagging or cross-referencing?
>
>
> - Full or shallow parse
> - Named Entities
> - Linked mentions
>
> Have a look at this thread:
> http://mail-archives.apache.org/mod_mbox/opennlp-dev/201203.mbox/%[email protected]%3E
>
> I proposed the new format there and then implemented it.
>
> For OntoNotes we need to do some adaption to get it into something
> you can use for training, e.g. filtering verb mentions, doing the parsing,
> etc.
> If we get it trained nicely on this dataset it would be a good step forward.
>
> Jörn
>



-- 
Lance Norskog
[email protected]

Reply via email to