Re: Releasing a Language Detection Model

2017-07-10 Thread Jeff Zemerick
+1 to an opennlp-models jar on Maven Central that contains the models.
+1 to having the models available for download separately (if easily
possible) for users who know what they want.
+1 to having the training data shared somewhere with scripts to generate
the models. It will help protect against losing data as William mentioned.
I don't think we should depend on others to reliably host the data. I'll
volunteer to help script the model generation to run on a fleet of EC2
instances if it helps.

If the user does not provide a model to use on the CLI, can the CLI tools
look on the classpath for a model whose name fits the needed model (like
en-ner-person.bin) and if found use it automatically?

Jeff



On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann  wrote:

> +1. In terms of releasing models, maybe an opennlp-models package, and then
> using Maven structure of src/main/resources//*.bin for
> putting the models.
>
> Then using an assembly descriptor to compile the above into a *-bin.jar?
>
> Cheers,
> Chris
>
>
>
>
> On 7/10/17, 4:09 PM, "Joern Kottmann"  wrote:
>
> My opinion about this is that we should offer the model as maven
> dependency for users who just want to use it in their projects, and
> also offer models for download for people to quickly try out OpenNLP.
> If the models can be downloaded, a new users could very quickly test
> it via the command line.
>
> I don't really have any thoughts yet on how we should organize it, it
> would probably be nice to have some place where we can share all the
> training data, and then have the scripts to produce the models checked
> in. It should be easy to retrain all the models in case we do a major
> release.
>
> In case a corpus is vanishing we should drop support for it, must be
> obsolete then.
>
> Jörn
>
> On Mon, Jul 10, 2017 at 8:50 PM, William Colen 
> wrote:
> > We need to address things such as sharing the evaluation results and
> how to
> > reproduce the training.
> >
> > There are several possibilities for that, but there are points to
> consider:
> >
> > Will we store the model itself in a SCM repository or only the code
> that
> > can build it?
> > Will we deploy the models to a Maven Central repository? It is good
> for
> > people using the Java API but not for command line interface, should
> we
> > change the CLI to handle models in the classpath?
> > Should we keep a copy of the training model or always download from
> the
> > original provider? We can't guarantee that the corpus will be there
> > forever, not only because it changed license, but simple because the
> > provider is not keeping the server up anymore.
> >
> > William
> >
> >
> >
> > 2017-07-10 14:52 GMT-03:00 Joern Kottmann :
> >
> >> Hello all,
> >>
> >> since Apache OpenNLP 1.8.1 we have a new language detection
> component
> >> which like all our components has to be trained. I think we should
> >> release a pre-build model for it trained on the Leipzig corpus. This
> >> will allow the majority of our users to get started very quickly
> with
> >> language detection without the need to figure out on how to train
> it.
> >>
> >> How should this project release models?
> >>
> >> Jörn
> >>
>
>
>
>


Re: Releasing a Language Detection Model

2017-07-10 Thread Joern Kottmann
My opinion about this is that we should offer the model as maven
dependency for users who just want to use it in their projects, and
also offer models for download for people to quickly try out OpenNLP.
If the models can be downloaded, a new users could very quickly test
it via the command line.

I don't really have any thoughts yet on how we should organize it, it
would probably be nice to have some place where we can share all the
training data, and then have the scripts to produce the models checked
in. It should be easy to retrain all the models in case we do a major
release.

In case a corpus is vanishing we should drop support for it, must be
obsolete then.

Jörn

On Mon, Jul 10, 2017 at 8:50 PM, William Colen  wrote:
> We need to address things such as sharing the evaluation results and how to
> reproduce the training.
>
> There are several possibilities for that, but there are points to consider:
>
> Will we store the model itself in a SCM repository or only the code that
> can build it?
> Will we deploy the models to a Maven Central repository? It is good for
> people using the Java API but not for command line interface, should we
> change the CLI to handle models in the classpath?
> Should we keep a copy of the training model or always download from the
> original provider? We can't guarantee that the corpus will be there
> forever, not only because it changed license, but simple because the
> provider is not keeping the server up anymore.
>
> William
>
>
>
> 2017-07-10 14:52 GMT-03:00 Joern Kottmann :
>
>> Hello all,
>>
>> since Apache OpenNLP 1.8.1 we have a new language detection component
>> which like all our components has to be trained. I think we should
>> release a pre-build model for it trained on the Leipzig corpus. This
>> will allow the majority of our users to get started very quickly with
>> language detection without the need to figure out on how to train it.
>>
>> How should this project release models?
>>
>> Jörn
>>


Re: Releasing a Language Detection Model

2017-07-10 Thread William Colen
We need to address things such as sharing the evaluation results and how to
reproduce the training.

There are several possibilities for that, but there are points to consider:

Will we store the model itself in a SCM repository or only the code that
can build it?
Will we deploy the models to a Maven Central repository? It is good for
people using the Java API but not for command line interface, should we
change the CLI to handle models in the classpath?
Should we keep a copy of the training model or always download from the
original provider? We can't guarantee that the corpus will be there
forever, not only because it changed license, but simple because the
provider is not keeping the server up anymore.

William



2017-07-10 14:52 GMT-03:00 Joern Kottmann :

> Hello all,
>
> since Apache OpenNLP 1.8.1 we have a new language detection component
> which like all our components has to be trained. I think we should
> release a pre-build model for it trained on the Leipzig corpus. This
> will allow the majority of our users to get started very quickly with
> language detection without the need to figure out on how to train it.
>
> How should this project release models?
>
> Jörn
>


Releasing a Language Detection Model

2017-07-10 Thread Joern Kottmann
Hello all,

since Apache OpenNLP 1.8.1 we have a new language detection component
which like all our components has to be trained. I think we should
release a pre-build model for it trained on the Leipzig corpus. This
will allow the majority of our users to get started very quickly with
language detection without the need to figure out on how to train it.

How should this project release models?

Jörn