Great idea!
+1 for releasing models.
+1 to publish models in jars on Maven Central. This is the fastest way to
have somebody started. Moreover, having an extensible mechanism for others
to do it on their own is really helpful. I did this with extJWNL for
packaging WordNet data files. It is also convenient for packaging own
custom dictionaries and providing them via repositories. It reuses existing
infrastructure for things like versioning and distribution. Model metadata
has to be thought through though. Oh, what a mouthful...
+1 for separate download ("no dependency manager" cases)
+1 to publish data\scripts\provenance. The more reproducible it is, the
better.
+1 for some mechanism of loading models from classpath.
~ +1 to maybe explore classpath for a "default" model for API (code) use
cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
extJWNL. But this has to be well thought through as design mistakes here
might release some demons from jar hell. I didn't face it, but I'm not sure
the extJWNL design is best as I didn't do much research on alternatives.
And I'd think twice before adding model jars to main binary distribution.
+1 to store only the model-building-code in SCM repo. I would not bloat the
SCM with binaries. Maven repositories though not ideal, are better for this
than SCM (and there specialized tools like jFrog).
~ -1 about changing CLI to use models from classpath. There was no
proposal, but my understanding that it would be some sort of classpath://
URL - please correct or clarify. I'd like to see the proposal and use cases
where it is more convenient than current way of just pointing to the file.
Perhaps it depends. Our models are already zips with manifests. Jars are
zips too. Perhaps changing the model packaging layout to make it more
"jar-like" or augmenting it with metadata for searching default models from
classpath for the above cases of distributing through Maven repositories
and loading from code, but perhaps leaving CLI as is - even if your model
is technically on the classpath, in most cases you can point to a jar in
the file system and thus leave CLI like it is now. It seems that dealing
with classpath is more suitable (convenient, safer, customary, ...) for
developers fiddling with code than for users fiddling with command-line.
+1 for mirroring source corpora. The more reproducible things are the
better. But costs (infrastructure) and licenses (this looks like
redistribution which is not always allowed) might be the case.
I'd also propose to augment model metadata with (optional) information
about source corpora, provenance, as much reproduction information as
possible, etc. Mostly for easier reproduction and provenance tracking. In
my experience I had challenges recalling what y-d-u-en.bin was trained on,
on which revision of that corpus, which part or subset, language, and
whether it had also other annotations (and respective models) for
connecting all the possible models from that corpora (e.g.
sent-tok-pos-chunk-...).
Aliaksandr
On 10 July 2017 at 17:41, Jeff Zemerick <[email protected]> wrote:
> +1 to an opennlp-models jar on Maven Central that contains the models.
> +1 to having the models available for download separately (if easily
> possible) for users who know what they want.
> +1 to having the training data shared somewhere with scripts to generate
> the models. It will help protect against losing data as William mentioned.
> I don't think we should depend on others to reliably host the data. I'll
> volunteer to help script the model generation to run on a fleet of EC2
> instances if it helps.
>
> If the user does not provide a model to use on the CLI, can the CLI tools
> look on the classpath for a model whose name fits the needed model (like
> en-ner-person.bin) and if found use it automatically?
>
> Jeff
>
>
>
> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <[email protected]>
> wrote:
>
> > +1. In terms of releasing models, maybe an opennlp-models package, and
> then
> > using Maven structure of src/main/resources/<package prefix dirs>/*.bin
> for
> > putting the models.
> >
> > Then using an assembly descriptor to compile the above into a *-bin.jar?
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > On 7/10/17, 4:09 PM, "Joern Kottmann" <[email protected]> wrote:
> >
> > My opinion about this is that we should offer the model as maven
> > dependency for users who just want to use it in their projects, and
> > also offer models for download for people to quickly try out OpenNLP.
> > If the models can be downloaded, a new users could very quickly test
> > it via the command line.
> >
> > I don't really have any thoughts yet on how we should organize it, it
> > would probably be nice to have some place where we can share all the
> > training data, and then have the scripts to produce the models
> checked
> > in. It should be easy to retrain all the models in case we do a major
> > release.
> >
> > In case a corpus is vanishing we should drop support for it, must be
> > obsolete then.
> >
> > Jörn
> >
> > On Mon, Jul 10, 2017 at 8:50 PM, William Colen <[email protected]>
> > wrote:
> > > We need to address things such as sharing the evaluation results
> and
> > how to
> > > reproduce the training.
> > >
> > > There are several possibilities for that, but there are points to
> > consider:
> > >
> > > Will we store the model itself in a SCM repository or only the code
> > that
> > > can build it?
> > > Will we deploy the models to a Maven Central repository? It is good
> > for
> > > people using the Java API but not for command line interface,
> should
> > we
> > > change the CLI to handle models in the classpath?
> > > Should we keep a copy of the training model or always download from
> > the
> > > original provider? We can't guarantee that the corpus will be there
> > > forever, not only because it changed license, but simple because
> the
> > > provider is not keeping the server up anymore.
> > >
> > > William
> > >
> > >
> > >
> > > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <[email protected]>:
> > >
> > >> Hello all,
> > >>
> > >> since Apache OpenNLP 1.8.1 we have a new language detection
> > component
> > >> which like all our components has to be trained. I think we should
> > >> release a pre-build model for it trained on the Leipzig corpus.
> This
> > >> will allow the majority of our users to get started very quickly
> > with
> > >> language detection without the need to figure out on how to train
> > it.
> > >>
> > >> How should this project release models?
> > >>
> > >> Jörn
> > >>
> >
> >
> >
> >
>