Hi community,

Just to flag active interest in the coreference module.

It plays an important role in my team's pipeline - we are interested in 
relation extraction.  The module, in my view, is a strong advantage of the 
excellent OpenNLP project.  Agree that it feels a little neglected compared to 
the rest of the project, likely due to complexity. To discard it as 
abandon-ware would be a sad loss.  I've commented on this list previously about 
my experience getting the module working (I don't claim to be an expert!), so 
it seems there is other active interest too.

A cross-document entity disambiguation tool would indeed be an awesome addition!

James - thanks for your guidance on efforts to navigate copyright issues and 
build up-to-date models!

Thanks,

Ant


On Oct 1, 2013, at 9:00 PM, James Kosin <[email protected]> wrote:

> Mark & Michael & Others,
> 
> The current models where trained using old annotated news articles and are 
> really used as useful examples.  They were never meant to be complete or 
> otherwise in training.  The copyright issues are complicated but in a 
> nutshell the owners of the corpuses that where used allow us to use the 
> generated data for educational and research purposes only in most cases.  
> This means that commercial use is strictly forbidden by the copyright 
> holders, never mind the fact you can't generate the original or produce the 
> material from the models.  I know it sounds like an odd copyright, and some 
> models may be a bit more leanient on the details of the copyright.
> 
> The corpuses where generated by people doing research and other tasks via the 
> CONLL and other projects to train models to detect POS, NER, and other types 
> of pre-processing of textual data over the years.  Most of these have 
> continual yearly or biyearly projects to do additional work in these areas.  
> OpenNLP isn't directly involved in these (to my knowledge... I'm sure to get 
> some bad press on this).  But, the goals of the project are to get a set of 
> training and test data to experiment and research on different model 
> approaches to see if a best model for the type of 
> parsing/processing/understanding, etc. of the textual data can be found for 
> the situation.
> 
> With an APACHE license, we have to be able to distribute the sources for the 
> models to be able to align with the license... as such, we have other side 
> projects setup to research and develop an easier method to generate and tag 
> the data for the various types of corpus data we need to train against.  But, 
> the catch is the data we gather needs to be FREE of any legal copyrights... 
> we have found several avenues that seem promissing in this area.
> https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations
> 
> We have sources in the sandbox for this and other works in the opennlp 
> project as well... in progress for the OpenNLP project.
>    http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
>    https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]
> 
> By all means please get involved!
> We need people who can read and annotate various languages.  We need people 
> who can test models.  We need people who can come up with new ideas.  We have 
> other projects in WIKI for adding support for other model types other than 
> just maxent.  There is also another for using SORA as the language.
> 
> Thanks for lisenning to me,
> James Kosin
> 

Reply via email to