> The current models where trained using old annotated news articles and are 
> really used as useful examples.  They were never meant to be complete or 
> otherwise in training.  The copyright issues are complicated but in a 
> nutshell the owners of the corpuses that where used allow us to use the 
> generated data for educational and research purposes only in most cases.  
> This means that commercial use is strictly forbidden by the copyright 
> holders, never mind the fact you can't generate the original or produce the 
> material from the models.  I know it sounds like an odd copyright, and some 
> models may be a bit more leanient on the details of the copyright.

Do you know how many sentences/tokens were annotated for the OpenNLP
POS and CHUNK models?  Do you have an idea of the "sweet spot" for
number of annotations vs performance?

Peace.  Michael

On Tue, Oct 1, 2013 at 8:00 PM, James Kosin <[email protected]> wrote:
> Mark & Michael & Others,
>
> The current models where trained using old annotated news articles and are
> really used as useful examples.  They were never meant to be complete or
> otherwise in training.  The copyright issues are complicated but in a
> nutshell the owners of the corpuses that where used allow us to use the
> generated data for educational and research purposes only in most cases.
> This means that commercial use is strictly forbidden by the copyright
> holders, never mind the fact you can't generate the original or produce the
> material from the models.  I know it sounds like an odd copyright, and some
> models may be a bit more leanient on the details of the copyright.
>
> The corpuses where generated by people doing research and other tasks via
> the CONLL and other projects to train models to detect POS, NER, and other
> types of pre-processing of textual data over the years.  Most of these have
> continual yearly or biyearly projects to do additional work in these areas.
> OpenNLP isn't directly involved in these (to my knowledge... I'm sure to get
> some bad press on this).  But, the goals of the project are to get a set of
> training and test data to experiment and research on different model
> approaches to see if a best model for the type of
> parsing/processing/understanding, etc. of the textual data can be found for
> the situation.
>
> With an APACHE license, we have to be able to distribute the sources for the
> models to be able to align with the license... as such, we have other side
> projects setup to research and develop an easier method to generate and tag
> the data for the various types of corpus data we need to train against.
> But, the catch is the data we gather needs to be FREE of any legal
> copyrights... we have found several avenues that seem promissing in this
> area.
> https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations
>
> We have sources in the sandbox for this and other works in the opennlp
> project as well... in progress for the OpenNLP project.
>     http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
>     https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]
>
> By all means please get involved!
> We need people who can read and annotate various languages.  We need people
> who can test models.  We need people who can come up with new ideas.  We
> have other projects in WIKI for adding support for other model types other
> than just maxent.  There is also another for using SORA as the language.
>
> Thanks for lisenning to me,
> James Kosin
>

Reply via email to