Sounds good to me…
On 7/11/17, 9:30 AM, "Joern Kottmann" <kottm...@gmail.com> wrote: Hello, right, very good point, I also think that it is very important to load a model in one from the classpath. I propose we have the following setup: - One jar contains one or multiple model packages (thats the zip container) - A model name itself should be kind of unique e.g. eng-ud-token.bin - A user loads the model via: new SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream gets then closed properly Lets take away three things from this discussion: 1) Store the data in a place where the community can access it 2) Offer models on our download page similar as it is done today on the SourceForge page 3) Release models packed inside a jar file via maven central Jörn On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu <aliaksa...@autayeu.com> wrote: > To clarify on models and jars. > > Putting model inside jar might not be a good idea. I mean here things like > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars" > already in a sense. We're good. However, current packaging and metadata > might not be very classpath friendly. > > The use case I have in mind is being able to add needed models as > dependencies and load them by writing a line of code. For this case having > all models in a root with the same name might not be very convenient. Same > goes for manifest. The name "manifest.properties" is quite generic and it's > not too far-fetched to see some clashes because some other lib also > manifests something. It might be better to allow for some flexibility and > to adhere to classpath conventions. For example, having manifests in > something like org/apache/opennlp/models/manifest.properties. Or > opennlp/tools/manifest.properties. And perhaps even allowing to reference a > model in the manifest, so the model can be put elsewhere. Just in case > there are several custom models of the same kind for different pipelines in > the same app. For example, processing queries with one pipeline - one set > of models - and processing documents with another pipeline - another set of > models. In this case allowing for different classpaths is needed. > > Perhaps to illustrate my thinking, something like this (which still keeps a > lot of possibilities open): > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains > a line with something like model = > /opennlp/tools/sentdetect/model/sent.model) > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > This allows including en-sent.bin as dependency. And then doing something > like > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want > default models in this way. Seems verbose enough to allow for some safety > through explicitness. That's if we want any defaults at all. > Or something like: > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties"); > Or > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model"); > Or more in-line with a current style: > SentenceModel sdm = new > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here > we commit to interpreting String as classpath reference. That's why I'd > prefer more explicit method names. > Or leave dealing with resources to the users, leave current code intact and > provide only packaging and distribution: > SentenceModel sdm = new > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > model")); > > > And to add to model metadata also F1\accuracy (at least CV-based, for > example 10-fold) for quick reference or quick understanding of what that > model is capable of. Could be helpful for those with a bunch of models > around. And for others as well to have a better insight about the model in > question. > > > > On 11 July 2017 at 06:37, Chris Mattmann <mattm...@apache.org> wrote: > >> Hi, >> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to >> override an >> internal classpath dependency. This is for people in environments who want >> a sensible >> / delivered internal classpath default and the ability for run-time, non >> zipped up/messing >> with JAR file override. Think about people who are using OpenNLP in both >> Java/Python >> environments as an example. >> >> Cheers, >> Chris >> >> >> >> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <kottm...@gmail.com> wrote: >> >> I would not change the CLI to load models from jar files. I never used >> or saw a command line tool that expects a file as an input and would >> then also load it from inside a jar file. It will be hard to >> communicate how that works precisely in the CLI usage texts and this >> is not a feature anyone would expect to be there. The intention of the >> CLI is to give users the ability to quickly test OpenNLP before they >> integrate it into their software and to train and evaluate models >> >> Users who for some reason have a jar file with a model inside can just >> write "unzip model.jar". >> >> After all I think this is quite a bit of complexity we would need to >> add for it and it will have very limited use. >> >> The use case of publishing jar files is to make the models easily >> available to people who have a build system with dependency >> management, they won't have to download models manually, and when they >> update OpenNLP then can also update the models with a version string >> change. >> >> For the command line "quick start" use case we should offer the models >> on a download page as we do today. This page could list both, the >> download link and the maven dependency. >> >> Jörn >> >> On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> >> wrote: >> > We need to address things such as sharing the evaluation results and >> how to >> > reproduce the training. >> > >> > There are several possibilities for that, but there are points to >> consider: >> > >> > Will we store the model itself in a SCM repository or only the code >> that >> > can build it? >> > Will we deploy the models to a Maven Central repository? It is good >> for >> > people using the Java API but not for command line interface, should >> we >> > change the CLI to handle models in the classpath? >> > Should we keep a copy of the training model or always download from >> the >> > original provider? We can't guarantee that the corpus will be there >> > forever, not only because it changed license, but simple because the >> > provider is not keeping the server up anymore. >> > >> > William >> > >> > >> > >> > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>: >> > >> >> Hello all, >> >> >> >> since Apache OpenNLP 1.8.1 we have a new language detection >> component >> >> which like all our components has to be trained. I think we should >> >> release a pre-build model for it trained on the Leipzig corpus. This >> >> will allow the majority of our users to get started very quickly >> with >> >> language detection without the need to figure out on how to train >> it. >> >> >> >> How should this project release models? >> >> >> >> Jörn >> >> >> >> >> >>