Please use unencumbered training data for all future OpenNLP projects.

What exactly does a coref training dataset have to include? What kind
of tagging or cross-referencing?

On Tue, Jul 17, 2012 at 10:59 AM, John Stewart <[email protected]> wrote:
> Ah good, I was going to ask about parses too -- so this is done.  I'll
> start reading the code tonight.
>
> OntoNotes is smallish, yes?  Is the English bit larger than the CoNLL
> data set?  In terms of cost, isn't it free?
>
> Thanks,
>
> jds
>
> On Tue, Jul 17, 2012 at 11:09 AM, Jörn Kottmann <[email protected]> wrote:
>> On 07/17/2012 05:03 PM, John Stewart wrote:
>>>
>>> OK so per thishttps://issues.apache.org/jira/browse/OPENNLP-54
>>>
>>> you're saying that results may improve with the CONLL training set,
>>> yes?  That definitely seems worth trying to me.  Now, what, if any,
>>> policies are there about dependencies between OpenNLP modules?  I ask
>>> because the coref task might benefit from the NE output -- perhaps
>>> they are already linked!
>>
>>
>> The input for coref is this:
>> - Full or shallow parse (depends on how the model was trained)
>> - NER output
>>
>> All this information is encoded into Parse objects and therefore no
>> direct link between the components is necessary.
>> You can see this nicely when you run the command line demo.
>>
>> Yes, we need a corpus to train it on. Maybe OntoNotes would be a good
>> candidate, its affordable to everyone.
>>
>> What do you think?
>>
>> Jörn
>>



-- 
Lance Norskog
[email protected]

Reply via email to