Re: Releasing a Language Detection Model

Joern Kottmann Tue, 11 Jul 2017 06:30:45 -0700

Hello,

right, very good point, I also think that it is very important to load
a model in one from the classpath.


I propose we have the following setup:
- One jar contains one or multiple model packages (thats the zip container)
- A model name itself should be kind of unique  e.g. eng-ud-token.bin
- A user loads the model via: new
SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
gets then closed properly


Lets take away three things from this discussion:
1) Store the data in a place where the community can access it
2) Offer models on our download page similar as it is done today on
the SourceForge page
3) Release models packed inside a jar file via maven central

Jörn







On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
<[email protected]> wrote:
> To clarify on models and jars.
>
> Putting model inside jar might not be a good idea. I mean here things like
> bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
> already in a sense. We're good. However, current packaging and metadata
> might not be very classpath friendly.
>
> The use case I have in mind is being able to add needed models as
> dependencies and load them by writing a line of code. For this case having
> all models in a root with the same name might not be very convenient. Same
> goes for manifest. The name "manifest.properties" is quite generic and it's
> not too far-fetched to see some clashes because some other lib also
> manifests something. It might be better to allow for some flexibility and
> to adhere to classpath conventions. For example, having manifests in
> something like org/apache/opennlp/models/manifest.properties. Or
> opennlp/tools/manifest.properties. And perhaps even allowing to reference a
> model in the manifest, so the model can be put elsewhere. Just in case
> there are several custom models of the same kind for different pipelines in
> the same app. For example, processing queries with one pipeline - one set
> of models - and processing documents with another pipeline - another set of
> models. In this case allowing for different classpaths is needed.
>
> Perhaps to illustrate my thinking, something like this (which still keeps a
> lot of possibilities open):
> en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
> a line with something like model =
> /opennlp/tools/sentdetect/model/sent.model)
> en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>
> This allows including en-sent.bin as dependency. And then doing something
> like
> SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
> default models in this way. Seems verbose enough to allow for some safety
> through explicitness. That's if we want any defaults at all.
> Or something like:
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
> Or
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
> Or more in-line with a current style:
> SentenceModel sdm = new
> SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
> we commit to interpreting String as classpath reference. That's why I'd
> prefer more explicit method names.
> Or leave dealing with resources to the users, leave current code intact and
> provide only packaging and distribution:
> SentenceModel sdm = new
> SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> model"));
>
>
> And to add to model metadata also F1\accuracy (at least CV-based, for
> example 10-fold) for quick reference or quick understanding of what that
> model is capable of. Could be helpful for those with a bunch of models
> around. And for others as well to have a better insight about the model in
> question.
>
>
>
> On 11 July 2017 at 06:37, Chris Mattmann <[email protected]> wrote:
>
>> Hi,
>>
>> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
>> override an
>> internal classpath dependency. This is for people in environments who want
>> a sensible
>> / delivered internal classpath default and the ability for run-time, non
>> zipped up/messing
>> with JAR file override. Think about people who are using OpenNLP in both
>> Java/Python
>> environments as an example.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> On 7/11/17, 3:25 AM, "Joern Kottmann" <[email protected]> wrote:
>>
>>     I would not change the CLI to load models from jar files. I never used
>>     or saw a command line tool that expects a file as an input and would
>>     then also load it from inside a jar file. It will be hard to
>>     communicate how that works precisely in the CLI usage texts and this
>>     is not a feature anyone would expect to be there. The intention of the
>>     CLI is to give users the ability to quickly test OpenNLP before they
>>     integrate it into their software and to train and evaluate models
>>
>>     Users who for some reason have a jar file with a model inside can just
>>     write "unzip model.jar".
>>
>>     After all I think this is quite  a bit of complexity we would need to
>>     add for it and it will have very limited use.
>>
>>     The use case of publishing jar files is to make the models easily
>>     available to people who have a build system with dependency
>>     management, they won't have to download models manually, and when they
>>     update OpenNLP then can also update the models with a version string
>>     change.
>>
>>     For the command line "quick start" use case we should offer the models
>>     on a download page as we do today. This page could list both, the
>>     download link and the maven dependency.
>>
>>     Jörn
>>
>>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <[email protected]>
>> wrote:
>>     > We need to address things such as sharing the evaluation results and
>> how to
>>     > reproduce the training.
>>     >
>>     > There are several possibilities for that, but there are points to
>> consider:
>>     >
>>     > Will we store the model itself in a SCM repository or only the code
>> that
>>     > can build it?
>>     > Will we deploy the models to a Maven Central repository? It is good
>> for
>>     > people using the Java API but not for command line interface, should
>> we
>>     > change the CLI to handle models in the classpath?
>>     > Should we keep a copy of the training model or always download from
>> the
>>     > original provider? We can't guarantee that the corpus will be there
>>     > forever, not only because it changed license, but simple because the
>>     > provider is not keeping the server up anymore.
>>     >
>>     > William
>>     >
>>     >
>>     >
>>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <[email protected]>:
>>     >
>>     >> Hello all,
>>     >>
>>     >> since Apache OpenNLP 1.8.1 we have a new language detection
>> component
>>     >> which like all our components has to be trained. I think we should
>>     >> release a pre-build model for it trained on the Leipzig corpus. This
>>     >> will allow the majority of our users to get started very quickly
>> with
>>     >> language detection without the need to figure out on how to train
>> it.
>>     >>
>>     >> How should this project release models?
>>     >>
>>     >> Jörn
>>     >>
>>
>>
>>
>>

Re: Releasing a Language Detection Model

Reply via email to