Re: Releasing a Language Detection Model
+1 2017-07-11 10:30 GMT-03:00 Joern Kottmann: > Hello, > > right, very good point, I also think that it is very important to load > a model in one from the classpath. > > I propose we have the following setup: > - One jar contains one or multiple model packages (thats the zip container) > - A model name itself should be kind of unique e.g. eng-ud-token.bin > - A user loads the model via: new > SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream > gets then closed properly > > > Lets take away three things from this discussion: > 1) Store the data in a place where the community can access it > 2) Offer models on our download page similar as it is done today on > the SourceForge page > 3) Release models packed inside a jar file via maven central > > Jörn > > > > > > > > On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu > wrote: > > To clarify on models and jars. > > > > Putting model inside jar might not be a good idea. I mean here things > like > > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are > "jars" > > already in a sense. We're good. However, current packaging and metadata > > might not be very classpath friendly. > > > > The use case I have in mind is being able to add needed models as > > dependencies and load them by writing a line of code. For this case > having > > all models in a root with the same name might not be very convenient. > Same > > goes for manifest. The name "manifest.properties" is quite generic and > it's > > not too far-fetched to see some clashes because some other lib also > > manifests something. It might be better to allow for some flexibility and > > to adhere to classpath conventions. For example, having manifests in > > something like org/apache/opennlp/models/manifest.properties. Or > > opennlp/tools/manifest.properties. And perhaps even allowing to > reference a > > model in the manifest, so the model can be put elsewhere. Just in case > > there are several custom models of the same kind for different pipelines > in > > the same app. For example, processing queries with one pipeline - one set > > of models - and processing documents with another pipeline - another set > of > > models. In this case allowing for different classpaths is needed. > > > > Perhaps to illustrate my thinking, something like this (which still > keeps a > > lot of possibilities open): > > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps > contains > > a line with something like model = > > /opennlp/tools/sentdetect/model/sent.model) > > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > > > This allows including en-sent.bin as dependency. And then doing something > > like > > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we > want > > default models in this way. Seems verbose enough to allow for some safety > > through explicitness. That's if we want any defaults at all. > > Or something like: > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest. > properties"); > > Or > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent. > model"); > > Or more in-line with a current style: > > SentenceModel sdm = new > > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though > here > > we commit to interpreting String as classpath reference. That's why I'd > > prefer more explicit method names. > > Or leave dealing with resources to the users, leave current code intact > and > > provide only packaging and distribution: > > SentenceModel sdm = new > > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > > model")); > > > > > > And to add to model metadata also F1\accuracy (at least CV-based, for > > example 10-fold) for quick reference or quick understanding of what that > > model is capable of. Could be helpful for those with a bunch of models > > around. And for others as well to have a better insight about the model > in > > question. > > > > > > > > On 11 July 2017 at 06:37, Chris Mattmann wrote: > > > >> Hi, > >> > >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI > to > >> override an > >> internal classpath dependency. This is for people in environments who > want > >> a sensible > >> / delivered internal classpath default and the ability for run-time, non > >> zipped up/messing > >> with JAR file override. Think about people who are using OpenNLP in both > >> Java/Python > >> environments as an example. > >> > >> Cheers, > >> Chris > >> > >> > >> > >> > >> On 7/11/17, 3:25 AM, "Joern Kottmann" wrote: > >> > >> I would not change the CLI to load models from jar files. I never > used > >> or saw a command line tool that expects a file as an input and would > >> then also load it from inside a jar file. It will be hard to > >> communicate how that works precisely in the
Re: Releasing a Language Detection Model
Sounds good to me… On 7/11/17, 9:30 AM, "Joern Kottmann"wrote: Hello, right, very good point, I also think that it is very important to load a model in one from the classpath. I propose we have the following setup: - One jar contains one or multiple model packages (thats the zip container) - A model name itself should be kind of unique e.g. eng-ud-token.bin - A user loads the model via: new SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream gets then closed properly Lets take away three things from this discussion: 1) Store the data in a place where the community can access it 2) Offer models on our download page similar as it is done today on the SourceForge page 3) Release models packed inside a jar file via maven central Jörn On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu wrote: > To clarify on models and jars. > > Putting model inside jar might not be a good idea. I mean here things like > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars" > already in a sense. We're good. However, current packaging and metadata > might not be very classpath friendly. > > The use case I have in mind is being able to add needed models as > dependencies and load them by writing a line of code. For this case having > all models in a root with the same name might not be very convenient. Same > goes for manifest. The name "manifest.properties" is quite generic and it's > not too far-fetched to see some clashes because some other lib also > manifests something. It might be better to allow for some flexibility and > to adhere to classpath conventions. For example, having manifests in > something like org/apache/opennlp/models/manifest.properties. Or > opennlp/tools/manifest.properties. And perhaps even allowing to reference a > model in the manifest, so the model can be put elsewhere. Just in case > there are several custom models of the same kind for different pipelines in > the same app. For example, processing queries with one pipeline - one set > of models - and processing documents with another pipeline - another set of > models. In this case allowing for different classpaths is needed. > > Perhaps to illustrate my thinking, something like this (which still keeps a > lot of possibilities open): > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains > a line with something like model = > /opennlp/tools/sentdetect/model/sent.model) > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > This allows including en-sent.bin as dependency. And then doing something > like > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want > default models in this way. Seems verbose enough to allow for some safety > through explicitness. That's if we want any defaults at all. > Or something like: > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties"); > Or > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model"); > Or more in-line with a current style: > SentenceModel sdm = new > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here > we commit to interpreting String as classpath reference. That's why I'd > prefer more explicit method names. > Or leave dealing with resources to the users, leave current code intact and > provide only packaging and distribution: > SentenceModel sdm = new > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > model")); > > > And to add to model metadata also F1\accuracy (at least CV-based, for > example 10-fold) for quick reference or quick understanding of what that > model is capable of. Could be helpful for those with a bunch of models > around. And for others as well to have a better insight about the model in > question. > > > > On 11 July 2017 at 06:37, Chris Mattmann wrote: > >> Hi, >> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to >> override an >> internal classpath dependency. This is for people in environments who want >> a sensible >> / delivered internal classpath default and the ability for run-time, non >> zipped up/messing >> with JAR file override. Think about people who are using OpenNLP in both >> Java/Python >> environments as an example. >> >> Cheers, >> Chris >> >> >> >> >> On 7/11/17, 3:25 AM, "Joern Kottmann" wrote: >> >> I would not change the CLI to load models from
Re: Releasing a Language Detection Model
1) This already included today by default in the model, it is possible to also place more data in it e.g. a file which contains eval results, a LICENSE and NOTICE file, etc 2) I would take a "best effort" approach and only publish one model per task and data set, if there are not really good reasons to publish multiple. In case of langdetect the perceptron and maxent models perform almost identical, so no need to publish both. Probably we should pick the perceptron model because it is slightly faster. And if a user disagrees with us - that is totally fine - he can always train himself with his personal preferences. All the knowledge on how to train a model should be accessible via git, and then it is just a matter of running the right command to start it. Jörn On Tue, Jul 11, 2017 at 3:35 PM, Suneel Marthiwrote: > ...one last point before wrapping up this discussion. Is it possible to > that u could have more than one lang detect model but trained with > different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron' > > Questions: > > 1. Do we just publish one model trained on a specific algorithm, if so > the metadata would have the algorithm information ? > > 2. Do we publish multiple models for the same task, each trained on > different algorithms ? > > > > On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann wrote: > >> Hello, >> >> right, very good point, I also think that it is very important to load >> a model in one from the classpath. >> >> I propose we have the following setup: >> - One jar contains one or multiple model packages (thats the zip container) >> - A model name itself should be kind of unique e.g. eng-ud-token.bin >> - A user loads the model via: new >> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream >> gets then closed properly >> >> >> Lets take away three things from this discussion: >> 1) Store the data in a place where the community can access it >> 2) Offer models on our download page similar as it is done today on >> the SourceForge page >> 3) Release models packed inside a jar file via maven central >> >> Jörn >> >> >> >> >> >> >> >> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu >> wrote: >> > To clarify on models and jars. >> > >> > Putting model inside jar might not be a good idea. I mean here things >> like >> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are >> "jars" >> > already in a sense. We're good. However, current packaging and metadata >> > might not be very classpath friendly. >> > >> > The use case I have in mind is being able to add needed models as >> > dependencies and load them by writing a line of code. For this case >> having >> > all models in a root with the same name might not be very convenient. >> Same >> > goes for manifest. The name "manifest.properties" is quite generic and >> it's >> > not too far-fetched to see some clashes because some other lib also >> > manifests something. It might be better to allow for some flexibility and >> > to adhere to classpath conventions. For example, having manifests in >> > something like org/apache/opennlp/models/manifest.properties. Or >> > opennlp/tools/manifest.properties. And perhaps even allowing to >> reference a >> > model in the manifest, so the model can be put elsewhere. Just in case >> > there are several custom models of the same kind for different pipelines >> in >> > the same app. For example, processing queries with one pipeline - one set >> > of models - and processing documents with another pipeline - another set >> of >> > models. In this case allowing for different classpaths is needed. >> > >> > Perhaps to illustrate my thinking, something like this (which still >> keeps a >> > lot of possibilities open): >> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps >> contains >> > a line with something like model = >> > /opennlp/tools/sentdetect/model/sent.model) >> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model >> > >> > This allows including en-sent.bin as dependency. And then doing something >> > like >> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we >> want >> > default models in this way. Seems verbose enough to allow for some safety >> > through explicitness. That's if we want any defaults at all. >> > Or something like: >> > SentenceModel sdm = >> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest. >> properties"); >> > Or >> > SentenceModel sdm = >> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent. >> model"); >> > Or more in-line with a current style: >> > SentenceModel sdm = new >> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though >> here >> > we commit to interpreting String as classpath reference. That's why I'd >> > prefer more explicit method names. >> > Or leave dealing with resources to the users, leave current code intact >> and >> > provide
Re: Releasing a Language Detection Model
...one last point before wrapping up this discussion. Is it possible to that u could have more than one lang detect model but trained with different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron' Questions: 1. Do we just publish one model trained on a specific algorithm, if so the metadata would have the algorithm information ? 2. Do we publish multiple models for the same task, each trained on different algorithms ? On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmannwrote: > Hello, > > right, very good point, I also think that it is very important to load > a model in one from the classpath. > > I propose we have the following setup: > - One jar contains one or multiple model packages (thats the zip container) > - A model name itself should be kind of unique e.g. eng-ud-token.bin > - A user loads the model via: new > SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream > gets then closed properly > > > Lets take away three things from this discussion: > 1) Store the data in a place where the community can access it > 2) Offer models on our download page similar as it is done today on > the SourceForge page > 3) Release models packed inside a jar file via maven central > > Jörn > > > > > > > > On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu > wrote: > > To clarify on models and jars. > > > > Putting model inside jar might not be a good idea. I mean here things > like > > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are > "jars" > > already in a sense. We're good. However, current packaging and metadata > > might not be very classpath friendly. > > > > The use case I have in mind is being able to add needed models as > > dependencies and load them by writing a line of code. For this case > having > > all models in a root with the same name might not be very convenient. > Same > > goes for manifest. The name "manifest.properties" is quite generic and > it's > > not too far-fetched to see some clashes because some other lib also > > manifests something. It might be better to allow for some flexibility and > > to adhere to classpath conventions. For example, having manifests in > > something like org/apache/opennlp/models/manifest.properties. Or > > opennlp/tools/manifest.properties. And perhaps even allowing to > reference a > > model in the manifest, so the model can be put elsewhere. Just in case > > there are several custom models of the same kind for different pipelines > in > > the same app. For example, processing queries with one pipeline - one set > > of models - and processing documents with another pipeline - another set > of > > models. In this case allowing for different classpaths is needed. > > > > Perhaps to illustrate my thinking, something like this (which still > keeps a > > lot of possibilities open): > > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps > contains > > a line with something like model = > > /opennlp/tools/sentdetect/model/sent.model) > > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > > > This allows including en-sent.bin as dependency. And then doing something > > like > > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we > want > > default models in this way. Seems verbose enough to allow for some safety > > through explicitness. That's if we want any defaults at all. > > Or something like: > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest. > properties"); > > Or > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent. > model"); > > Or more in-line with a current style: > > SentenceModel sdm = new > > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though > here > > we commit to interpreting String as classpath reference. That's why I'd > > prefer more explicit method names. > > Or leave dealing with resources to the users, leave current code intact > and > > provide only packaging and distribution: > > SentenceModel sdm = new > > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > > model")); > > > > > > And to add to model metadata also F1\accuracy (at least CV-based, for > > example 10-fold) for quick reference or quick understanding of what that > > model is capable of. Could be helpful for those with a bunch of models > > around. And for others as well to have a better insight about the model > in > > question. > > > > > > > > On 11 July 2017 at 06:37, Chris Mattmann wrote: > > > >> Hi, > >> > >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI > to > >> override an > >> internal classpath dependency. This is for people in environments who > want > >> a sensible > >> / delivered internal classpath default and the ability for run-time, non > >> zipped up/messing > >> with JAR file override. Think about people who are using OpenNLP in both > >>
Re: Releasing a Language Detection Model
Hello, right, very good point, I also think that it is very important to load a model in one from the classpath. I propose we have the following setup: - One jar contains one or multiple model packages (thats the zip container) - A model name itself should be kind of unique e.g. eng-ud-token.bin - A user loads the model via: new SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream gets then closed properly Lets take away three things from this discussion: 1) Store the data in a place where the community can access it 2) Offer models on our download page similar as it is done today on the SourceForge page 3) Release models packed inside a jar file via maven central Jörn On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeuwrote: > To clarify on models and jars. > > Putting model inside jar might not be a good idea. I mean here things like > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars" > already in a sense. We're good. However, current packaging and metadata > might not be very classpath friendly. > > The use case I have in mind is being able to add needed models as > dependencies and load them by writing a line of code. For this case having > all models in a root with the same name might not be very convenient. Same > goes for manifest. The name "manifest.properties" is quite generic and it's > not too far-fetched to see some clashes because some other lib also > manifests something. It might be better to allow for some flexibility and > to adhere to classpath conventions. For example, having manifests in > something like org/apache/opennlp/models/manifest.properties. Or > opennlp/tools/manifest.properties. And perhaps even allowing to reference a > model in the manifest, so the model can be put elsewhere. Just in case > there are several custom models of the same kind for different pipelines in > the same app. For example, processing queries with one pipeline - one set > of models - and processing documents with another pipeline - another set of > models. In this case allowing for different classpaths is needed. > > Perhaps to illustrate my thinking, something like this (which still keeps a > lot of possibilities open): > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains > a line with something like model = > /opennlp/tools/sentdetect/model/sent.model) > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > This allows including en-sent.bin as dependency. And then doing something > like > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want > default models in this way. Seems verbose enough to allow for some safety > through explicitness. That's if we want any defaults at all. > Or something like: > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties"); > Or > SentenceModel sdm = > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model"); > Or more in-line with a current style: > SentenceModel sdm = new > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here > we commit to interpreting String as classpath reference. That's why I'd > prefer more explicit method names. > Or leave dealing with resources to the users, leave current code intact and > provide only packaging and distribution: > SentenceModel sdm = new > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > model")); > > > And to add to model metadata also F1\accuracy (at least CV-based, for > example 10-fold) for quick reference or quick understanding of what that > model is capable of. Could be helpful for those with a bunch of models > around. And for others as well to have a better insight about the model in > question. > > > > On 11 July 2017 at 06:37, Chris Mattmann wrote: > >> Hi, >> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to >> override an >> internal classpath dependency. This is for people in environments who want >> a sensible >> / delivered internal classpath default and the ability for run-time, non >> zipped up/messing >> with JAR file override. Think about people who are using OpenNLP in both >> Java/Python >> environments as an example. >> >> Cheers, >> Chris >> >> >> >> >> On 7/11/17, 3:25 AM, "Joern Kottmann" wrote: >> >> I would not change the CLI to load models from jar files. I never used >> or saw a command line tool that expects a file as an input and would >> then also load it from inside a jar file. It will be hard to >> communicate how that works precisely in the CLI usage texts and this >> is not a feature anyone would expect to be there. The intention of the >> CLI is to give users the ability to quickly test OpenNLP before they >> integrate it into their software and to train and evaluate models >> >> Users who for some reason have a jar file with a model inside can
Re: Releasing a Language Detection Model
To clarify on models and jars. Putting model inside jar might not be a good idea. I mean here things like bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars" already in a sense. We're good. However, current packaging and metadata might not be very classpath friendly. The use case I have in mind is being able to add needed models as dependencies and load them by writing a line of code. For this case having all models in a root with the same name might not be very convenient. Same goes for manifest. The name "manifest.properties" is quite generic and it's not too far-fetched to see some clashes because some other lib also manifests something. It might be better to allow for some flexibility and to adhere to classpath conventions. For example, having manifests in something like org/apache/opennlp/models/manifest.properties. Or opennlp/tools/manifest.properties. And perhaps even allowing to reference a model in the manifest, so the model can be put elsewhere. Just in case there are several custom models of the same kind for different pipelines in the same app. For example, processing queries with one pipeline - one set of models - and processing documents with another pipeline - another set of models. In this case allowing for different classpaths is needed. Perhaps to illustrate my thinking, something like this (which still keeps a lot of possibilities open): en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains a line with something like model = /opennlp/tools/sentdetect/model/sent.model) en-sent.bin/opennlp/tools/sentdetect/model/sent.model This allows including en-sent.bin as dependency. And then doing something like SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want default models in this way. Seems verbose enough to allow for some safety through explicitness. That's if we want any defaults at all. Or something like: SentenceModel sdm = SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties"); Or SentenceModel sdm = SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model"); Or more in-line with a current style: SentenceModel sdm = new SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here we commit to interpreting String as classpath reference. That's why I'd prefer more explicit method names. Or leave dealing with resources to the users, leave current code intact and provide only packaging and distribution: SentenceModel sdm = new SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or model")); And to add to model metadata also F1\accuracy (at least CV-based, for example 10-fold) for quick reference or quick understanding of what that model is capable of. Could be helpful for those with a bunch of models around. And for others as well to have a better insight about the model in question. On 11 July 2017 at 06:37, Chris Mattmannwrote: > Hi, > > FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to > override an > internal classpath dependency. This is for people in environments who want > a sensible > / delivered internal classpath default and the ability for run-time, non > zipped up/messing > with JAR file override. Think about people who are using OpenNLP in both > Java/Python > environments as an example. > > Cheers, > Chris > > > > > On 7/11/17, 3:25 AM, "Joern Kottmann" wrote: > > I would not change the CLI to load models from jar files. I never used > or saw a command line tool that expects a file as an input and would > then also load it from inside a jar file. It will be hard to > communicate how that works precisely in the CLI usage texts and this > is not a feature anyone would expect to be there. The intention of the > CLI is to give users the ability to quickly test OpenNLP before they > integrate it into their software and to train and evaluate models > > Users who for some reason have a jar file with a model inside can just > write "unzip model.jar". > > After all I think this is quite a bit of complexity we would need to > add for it and it will have very limited use. > > The use case of publishing jar files is to make the models easily > available to people who have a build system with dependency > management, they won't have to download models manually, and when they > update OpenNLP then can also update the models with a version string > change. > > For the command line "quick start" use case we should offer the models > on a download page as we do today. This page could list both, the > download link and the maven dependency. > > Jörn > > On Mon, Jul 10, 2017 at 8:50 PM, William Colen > wrote: > > We need to address things such as sharing the evaluation results and > how to > > reproduce the training. > > > > There are several possibilities for
Re: Releasing a Language Detection Model
Hi, FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to override an internal classpath dependency. This is for people in environments who want a sensible / delivered internal classpath default and the ability for run-time, non zipped up/messing with JAR file override. Think about people who are using OpenNLP in both Java/Python environments as an example. Cheers, Chris On 7/11/17, 3:25 AM, "Joern Kottmann"wrote: I would not change the CLI to load models from jar files. I never used or saw a command line tool that expects a file as an input and would then also load it from inside a jar file. It will be hard to communicate how that works precisely in the CLI usage texts and this is not a feature anyone would expect to be there. The intention of the CLI is to give users the ability to quickly test OpenNLP before they integrate it into their software and to train and evaluate models Users who for some reason have a jar file with a model inside can just write "unzip model.jar". After all I think this is quite a bit of complexity we would need to add for it and it will have very limited use. The use case of publishing jar files is to make the models easily available to people who have a build system with dependency management, they won't have to download models manually, and when they update OpenNLP then can also update the models with a version string change. For the command line "quick start" use case we should offer the models on a download page as we do today. This page could list both, the download link and the maven dependency. Jörn On Mon, Jul 10, 2017 at 8:50 PM, William Colen wrote: > We need to address things such as sharing the evaluation results and how to > reproduce the training. > > There are several possibilities for that, but there are points to consider: > > Will we store the model itself in a SCM repository or only the code that > can build it? > Will we deploy the models to a Maven Central repository? It is good for > people using the Java API but not for command line interface, should we > change the CLI to handle models in the classpath? > Should we keep a copy of the training model or always download from the > original provider? We can't guarantee that the corpus will be there > forever, not only because it changed license, but simple because the > provider is not keeping the server up anymore. > > William > > > > 2017-07-10 14:52 GMT-03:00 Joern Kottmann : > >> Hello all, >> >> since Apache OpenNLP 1.8.1 we have a new language detection component >> which like all our components has to be trained. I think we should >> release a pre-build model for it trained on the Leipzig corpus. This >> will allow the majority of our users to get started very quickly with >> language detection without the need to figure out on how to train it. >> >> How should this project release models? >> >> Jörn >>
Re: Releasing a Language Detection Model
I would not change the CLI to load models from jar files. I never used or saw a command line tool that expects a file as an input and would then also load it from inside a jar file. It will be hard to communicate how that works precisely in the CLI usage texts and this is not a feature anyone would expect to be there. The intention of the CLI is to give users the ability to quickly test OpenNLP before they integrate it into their software and to train and evaluate models Users who for some reason have a jar file with a model inside can just write "unzip model.jar". After all I think this is quite a bit of complexity we would need to add for it and it will have very limited use. The use case of publishing jar files is to make the models easily available to people who have a build system with dependency management, they won't have to download models manually, and when they update OpenNLP then can also update the models with a version string change. For the command line "quick start" use case we should offer the models on a download page as we do today. This page could list both, the download link and the maven dependency. Jörn On Mon, Jul 10, 2017 at 8:50 PM, William Colenwrote: > We need to address things such as sharing the evaluation results and how to > reproduce the training. > > There are several possibilities for that, but there are points to consider: > > Will we store the model itself in a SCM repository or only the code that > can build it? > Will we deploy the models to a Maven Central repository? It is good for > people using the Java API but not for command line interface, should we > change the CLI to handle models in the classpath? > Should we keep a copy of the training model or always download from the > original provider? We can't guarantee that the corpus will be there > forever, not only because it changed license, but simple because the > provider is not keeping the server up anymore. > > William > > > > 2017-07-10 14:52 GMT-03:00 Joern Kottmann : > >> Hello all, >> >> since Apache OpenNLP 1.8.1 we have a new language detection component >> which like all our components has to be trained. I think we should >> release a pre-build model for it trained on the Leipzig corpus. This >> will allow the majority of our users to get started very quickly with >> language detection without the need to figure out on how to train it. >> >> How should this project release models? >> >> Jörn >>
Re: Releasing a Language Detection Model
I am also not for default models. We are a library and people use it inside other software products, that is the place where meaningful defaults can be defined. Maybe our lang model works very well, you take that, hard code it and forget for the next couple of years about it, or it doesn't work and you train your own set of models and swap them depending on your input data source. And then there are solutions out there people can use to define configuration for their software projects, such as spring or typesafe. And probably something new one day. I am +1 to ensure that OpenNLP is easy to use with the most common ones and to accept PRs to increase ease of use. Jörn On Tue, Jul 11, 2017 at 3:45 AM,wrote: > +1 for releasing models > > as for the rest not sure how I feel. Is there just one model for the > Language Detector? I don’t want this to become a versioning issue > langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2. Can anyone > download the Leipzig corpus? Being able to reproduce the model is very > powerful, because if you have additional data you can add it to the Leipzig > corpus to improve your model. > > I am not a big fan of default models, because it is frustrating as a using > when unexpected things happen (like if you thing you are telling it to use > your model, but it uses the default). However, if the code is verbose > enough, this is really not a valid concern. I would want to see the use case > develop. > Daniel > > >> On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu >> wrote: >> >> Great idea! >> >> +1 for releasing models. >> >> +1 to publish models in jars on Maven Central. This is the fastest way to >> have somebody started. Moreover, having an extensible mechanism for others >> to do it on their own is really helpful. I did this with extJWNL for >> packaging WordNet data files. It is also convenient for packaging own >> custom dictionaries and providing them via repositories. It reuses existing >> infrastructure for things like versioning and distribution. Model metadata >> has to be thought through though. Oh, what a mouthful... >> >> +1 for separate download ("no dependency manager" cases) >> >> +1 to publish data\scripts\provenance. The more reproducible it is, the >> better. >> >> +1 for some mechanism of loading models from classpath. >> >> ~ +1 to maybe explore classpath for a "default" model for API (code) use >> cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from >> extJWNL. But this has to be well thought through as design mistakes here >> might release some demons from jar hell. I didn't face it, but I'm not sure >> the extJWNL design is best as I didn't do much research on alternatives. >> And I'd think twice before adding model jars to main binary distribution. >> >> +1 to store only the model-building-code in SCM repo. I would not bloat the >> SCM with binaries. Maven repositories though not ideal, are better for this >> than SCM (and there specialized tools like jFrog). >> >> ~ -1 about changing CLI to use models from classpath. There was no >> proposal, but my understanding that it would be some sort of classpath:// >> URL - please correct or clarify. I'd like to see the proposal and use cases >> where it is more convenient than current way of just pointing to the file. >> Perhaps it depends. Our models are already zips with manifests. Jars are >> zips too. Perhaps changing the model packaging layout to make it more >> "jar-like" or augmenting it with metadata for searching default models from >> classpath for the above cases of distributing through Maven repositories >> and loading from code, but perhaps leaving CLI as is - even if your model >> is technically on the classpath, in most cases you can point to a jar in >> the file system and thus leave CLI like it is now. It seems that dealing >> with classpath is more suitable (convenient, safer, customary, ...) for >> developers fiddling with code than for users fiddling with command-line. >> >> +1 for mirroring source corpora. The more reproducible things are the >> better. But costs (infrastructure) and licenses (this looks like >> redistribution which is not always allowed) might be the case. >> >> I'd also propose to augment model metadata with (optional) information >> about source corpora, provenance, as much reproduction information as >> possible, etc. Mostly for easier reproduction and provenance tracking. In >> my experience I had challenges recalling what y-d-u-en.bin was trained on, >> on which revision of that corpus, which part or subset, language, and >> whether it had also other annotations (and respective models) for >> connecting all the possible models from that corpora (e.g. >> sent-tok-pos-chunk-...). >> >> Aliaksandr >> >> On 10 July 2017 at 17:41, Jeff Zemerick wrote: >> >>> +1 to an opennlp-models jar on Maven Central that contains the models. >>> +1 to having the models