Re: Releasing a Language Detection Model

2017-07-11 Thread William Colen
+1


2017-07-11 10:30 GMT-03:00 Joern Kottmann :

> Hello,
>
> right, very good point, I also think that it is very important to load
> a model in one from the classpath.
>
> I propose we have the following setup:
> - One jar contains one or multiple model packages (thats the zip container)
> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
> - A user loads the model via: new
> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
> gets then closed properly
>
>
> Lets take away three things from this discussion:
> 1) Store the data in a place where the community can access it
> 2) Offer models on our download page similar as it is done today on
> the SourceForge page
> 3) Release models packed inside a jar file via maven central
>
> Jörn
>
>
>
>
>
>
>
> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
>  wrote:
> > To clarify on models and jars.
> >
> > Putting model inside jar might not be a good idea. I mean here things
> like
> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
> "jars"
> > already in a sense. We're good. However, current packaging and metadata
> > might not be very classpath friendly.
> >
> > The use case I have in mind is being able to add needed models as
> > dependencies and load them by writing a line of code. For this case
> having
> > all models in a root with the same name might not be very convenient.
> Same
> > goes for manifest. The name "manifest.properties" is quite generic and
> it's
> > not too far-fetched to see some clashes because some other lib also
> > manifests something. It might be better to allow for some flexibility and
> > to adhere to classpath conventions. For example, having manifests in
> > something like org/apache/opennlp/models/manifest.properties. Or
> > opennlp/tools/manifest.properties. And perhaps even allowing to
> reference a
> > model in the manifest, so the model can be put elsewhere. Just in case
> > there are several custom models of the same kind for different pipelines
> in
> > the same app. For example, processing queries with one pipeline - one set
> > of models - and processing documents with another pipeline - another set
> of
> > models. In this case allowing for different classpaths is needed.
> >
> > Perhaps to illustrate my thinking, something like this (which still
> keeps a
> > lot of possibilities open):
> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
> contains
> > a line with something like model =
> > /opennlp/tools/sentdetect/model/sent.model)
> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
> >
> > This allows including en-sent.bin as dependency. And then doing something
> > like
> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
> want
> > default models in this way. Seems verbose enough to allow for some safety
> > through explicitness. That's if we want any defaults at all.
> > Or something like:
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
> properties");
> > Or
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
> model");
> > Or more in-line with a current style:
> > SentenceModel sdm = new
> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
> here
> > we commit to interpreting String as classpath reference. That's why I'd
> > prefer more explicit method names.
> > Or leave dealing with resources to the users, leave current code intact
> and
> > provide only packaging and distribution:
> > SentenceModel sdm = new
> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> > model"));
> >
> >
> > And to add to model metadata also F1\accuracy (at least CV-based, for
> > example 10-fold) for quick reference or quick understanding of what that
> > model is capable of. Could be helpful for those with a bunch of models
> > around. And for others as well to have a better insight about the model
> in
> > question.
> >
> >
> >
> > On 11 July 2017 at 06:37, Chris Mattmann  wrote:
> >
> >> Hi,
> >>
> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
> to
> >> override an
> >> internal classpath dependency. This is for people in environments who
> want
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> Java/Python
> >> environments as an example.
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >>
> >> On 7/11/17, 3:25 AM, "Joern Kottmann"  wrote:
> >>
> >> I would not change the CLI to load models from jar files. I never
> used
> >> or saw a command line tool that expects a file as an input and would
> >> then also load it from inside a jar file. It will be hard to
> >> communicate how that works precisely in the 

Re: Releasing a Language Detection Model

2017-07-11 Thread Chris Mattmann
Sounds good to me…



On 7/11/17, 9:30 AM, "Joern Kottmann"  wrote:

Hello,

right, very good point, I also think that it is very important to load
a model in one from the classpath.

I propose we have the following setup:
- One jar contains one or multiple model packages (thats the zip container)
- A model name itself should be kind of unique  e.g. eng-ud-token.bin
- A user loads the model via: new
SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
gets then closed properly


Lets take away three things from this discussion:
1) Store the data in a place where the community can access it
2) Offer models on our download page similar as it is done today on
the SourceForge page
3) Release models packed inside a jar file via maven central

Jörn







On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
 wrote:
> To clarify on models and jars.
>
> Putting model inside jar might not be a good idea. I mean here things like
> bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
> already in a sense. We're good. However, current packaging and metadata
> might not be very classpath friendly.
>
> The use case I have in mind is being able to add needed models as
> dependencies and load them by writing a line of code. For this case having
> all models in a root with the same name might not be very convenient. Same
> goes for manifest. The name "manifest.properties" is quite generic and 
it's
> not too far-fetched to see some clashes because some other lib also
> manifests something. It might be better to allow for some flexibility and
> to adhere to classpath conventions. For example, having manifests in
> something like org/apache/opennlp/models/manifest.properties. Or
> opennlp/tools/manifest.properties. And perhaps even allowing to reference 
a
> model in the manifest, so the model can be put elsewhere. Just in case
> there are several custom models of the same kind for different pipelines 
in
> the same app. For example, processing queries with one pipeline - one set
> of models - and processing documents with another pipeline - another set 
of
> models. In this case allowing for different classpaths is needed.
>
> Perhaps to illustrate my thinking, something like this (which still keeps 
a
> lot of possibilities open):
> en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
> a line with something like model =
> /opennlp/tools/sentdetect/model/sent.model)
> en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>
> This allows including en-sent.bin as dependency. And then doing something
> like
> SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
> default models in this way. Seems verbose enough to allow for some safety
> through explicitness. That's if we want any defaults at all.
> Or something like:
> SentenceModel sdm =
> 
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
> Or
> SentenceModel sdm =
> 
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
> Or more in-line with a current style:
> SentenceModel sdm = new
> SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though 
here
> we commit to interpreting String as classpath reference. That's why I'd
> prefer more explicit method names.
> Or leave dealing with resources to the users, leave current code intact 
and
> provide only packaging and distribution:
> SentenceModel sdm = new
> SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> model"));
>
>
> And to add to model metadata also F1\accuracy (at least CV-based, for
> example 10-fold) for quick reference or quick understanding of what that
> model is capable of. Could be helpful for those with a bunch of models
> around. And for others as well to have a better insight about the model in
> question.
>
>
>
> On 11 July 2017 at 06:37, Chris Mattmann  wrote:
>
>> Hi,
>>
>> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI 
to
>> override an
>> internal classpath dependency. This is for people in environments who 
want
>> a sensible
>> / delivered internal classpath default and the ability for run-time, non
>> zipped up/messing
>> with JAR file override. Think about people who are using OpenNLP in both
>> Java/Python
>> environments as an example.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> On 7/11/17, 3:25 AM, "Joern Kottmann"  wrote:
>>
>> I would not change the CLI to load models from 

Re: Releasing a Language Detection Model

2017-07-11 Thread Joern Kottmann
1) This already included today by default in the model, it is possible
to also place more data in it e.g. a file which contains eval results,
a LICENSE and NOTICE file, etc

2) I would take a "best effort" approach and only publish one model
per task and data set, if there are not really good reasons to publish
multiple. In case of langdetect the perceptron and maxent models
perform almost identical, so no need to publish both. Probably we
should pick the perceptron model because it is slightly faster. And if
a user disagrees with us - that is totally fine - he can always train
himself with his personal preferences.

All the knowledge on how to train a model should be accessible via
git, and then it is just a matter of running the right command to
start it.

Jörn

On Tue, Jul 11, 2017 at 3:35 PM, Suneel Marthi  wrote:
> ...one last point before wrapping up this discussion.  Is it possible to
> that u could have more than one lang detect model but trained with
> different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron'
>
> Questions:
>
> 1.   Do we just publish one model trained on a specific algorithm, if so
> the metadata would have the algorithm information ?
>
> 2.  Do we publish multiple models for the same task, each trained on
> different algorithms ?
>
>
>
> On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann  wrote:
>
>> Hello,
>>
>> right, very good point, I also think that it is very important to load
>> a model in one from the classpath.
>>
>> I propose we have the following setup:
>> - One jar contains one or multiple model packages (thats the zip container)
>> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
>> - A user loads the model via: new
>> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
>> gets then closed properly
>>
>>
>> Lets take away three things from this discussion:
>> 1) Store the data in a place where the community can access it
>> 2) Offer models on our download page similar as it is done today on
>> the SourceForge page
>> 3) Release models packed inside a jar file via maven central
>>
>> Jörn
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
>>  wrote:
>> > To clarify on models and jars.
>> >
>> > Putting model inside jar might not be a good idea. I mean here things
>> like
>> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
>> "jars"
>> > already in a sense. We're good. However, current packaging and metadata
>> > might not be very classpath friendly.
>> >
>> > The use case I have in mind is being able to add needed models as
>> > dependencies and load them by writing a line of code. For this case
>> having
>> > all models in a root with the same name might not be very convenient.
>> Same
>> > goes for manifest. The name "manifest.properties" is quite generic and
>> it's
>> > not too far-fetched to see some clashes because some other lib also
>> > manifests something. It might be better to allow for some flexibility and
>> > to adhere to classpath conventions. For example, having manifests in
>> > something like org/apache/opennlp/models/manifest.properties. Or
>> > opennlp/tools/manifest.properties. And perhaps even allowing to
>> reference a
>> > model in the manifest, so the model can be put elsewhere. Just in case
>> > there are several custom models of the same kind for different pipelines
>> in
>> > the same app. For example, processing queries with one pipeline - one set
>> > of models - and processing documents with another pipeline - another set
>> of
>> > models. In this case allowing for different classpaths is needed.
>> >
>> > Perhaps to illustrate my thinking, something like this (which still
>> keeps a
>> > lot of possibilities open):
>> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
>> contains
>> > a line with something like model =
>> > /opennlp/tools/sentdetect/model/sent.model)
>> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>> >
>> > This allows including en-sent.bin as dependency. And then doing something
>> > like
>> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
>> want
>> > default models in this way. Seems verbose enough to allow for some safety
>> > through explicitness. That's if we want any defaults at all.
>> > Or something like:
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
>> properties");
>> > Or
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
>> model");
>> > Or more in-line with a current style:
>> > SentenceModel sdm = new
>> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
>> here
>> > we commit to interpreting String as classpath reference. That's why I'd
>> > prefer more explicit method names.
>> > Or leave dealing with resources to the users, leave current code intact
>> and
>> > provide 

Re: Releasing a Language Detection Model

2017-07-11 Thread Suneel Marthi
...one last point before wrapping up this discussion.  Is it possible to
that u could have more than one lang detect model but trained with
different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron'

Questions:

1.   Do we just publish one model trained on a specific algorithm, if so
the metadata would have the algorithm information ?

2.  Do we publish multiple models for the same task, each trained on
different algorithms ?



On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann  wrote:

> Hello,
>
> right, very good point, I also think that it is very important to load
> a model in one from the classpath.
>
> I propose we have the following setup:
> - One jar contains one or multiple model packages (thats the zip container)
> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
> - A user loads the model via: new
> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
> gets then closed properly
>
>
> Lets take away three things from this discussion:
> 1) Store the data in a place where the community can access it
> 2) Offer models on our download page similar as it is done today on
> the SourceForge page
> 3) Release models packed inside a jar file via maven central
>
> Jörn
>
>
>
>
>
>
>
> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
>  wrote:
> > To clarify on models and jars.
> >
> > Putting model inside jar might not be a good idea. I mean here things
> like
> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
> "jars"
> > already in a sense. We're good. However, current packaging and metadata
> > might not be very classpath friendly.
> >
> > The use case I have in mind is being able to add needed models as
> > dependencies and load them by writing a line of code. For this case
> having
> > all models in a root with the same name might not be very convenient.
> Same
> > goes for manifest. The name "manifest.properties" is quite generic and
> it's
> > not too far-fetched to see some clashes because some other lib also
> > manifests something. It might be better to allow for some flexibility and
> > to adhere to classpath conventions. For example, having manifests in
> > something like org/apache/opennlp/models/manifest.properties. Or
> > opennlp/tools/manifest.properties. And perhaps even allowing to
> reference a
> > model in the manifest, so the model can be put elsewhere. Just in case
> > there are several custom models of the same kind for different pipelines
> in
> > the same app. For example, processing queries with one pipeline - one set
> > of models - and processing documents with another pipeline - another set
> of
> > models. In this case allowing for different classpaths is needed.
> >
> > Perhaps to illustrate my thinking, something like this (which still
> keeps a
> > lot of possibilities open):
> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
> contains
> > a line with something like model =
> > /opennlp/tools/sentdetect/model/sent.model)
> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
> >
> > This allows including en-sent.bin as dependency. And then doing something
> > like
> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
> want
> > default models in this way. Seems verbose enough to allow for some safety
> > through explicitness. That's if we want any defaults at all.
> > Or something like:
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
> properties");
> > Or
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
> model");
> > Or more in-line with a current style:
> > SentenceModel sdm = new
> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
> here
> > we commit to interpreting String as classpath reference. That's why I'd
> > prefer more explicit method names.
> > Or leave dealing with resources to the users, leave current code intact
> and
> > provide only packaging and distribution:
> > SentenceModel sdm = new
> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> > model"));
> >
> >
> > And to add to model metadata also F1\accuracy (at least CV-based, for
> > example 10-fold) for quick reference or quick understanding of what that
> > model is capable of. Could be helpful for those with a bunch of models
> > around. And for others as well to have a better insight about the model
> in
> > question.
> >
> >
> >
> > On 11 July 2017 at 06:37, Chris Mattmann  wrote:
> >
> >> Hi,
> >>
> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
> to
> >> override an
> >> internal classpath dependency. This is for people in environments who
> want
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> 

Re: Releasing a Language Detection Model

2017-07-11 Thread Joern Kottmann
Hello,

right, very good point, I also think that it is very important to load
a model in one from the classpath.

I propose we have the following setup:
- One jar contains one or multiple model packages (thats the zip container)
- A model name itself should be kind of unique  e.g. eng-ud-token.bin
- A user loads the model via: new
SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
gets then closed properly


Lets take away three things from this discussion:
1) Store the data in a place where the community can access it
2) Offer models on our download page similar as it is done today on
the SourceForge page
3) Release models packed inside a jar file via maven central

Jörn







On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
 wrote:
> To clarify on models and jars.
>
> Putting model inside jar might not be a good idea. I mean here things like
> bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
> already in a sense. We're good. However, current packaging and metadata
> might not be very classpath friendly.
>
> The use case I have in mind is being able to add needed models as
> dependencies and load them by writing a line of code. For this case having
> all models in a root with the same name might not be very convenient. Same
> goes for manifest. The name "manifest.properties" is quite generic and it's
> not too far-fetched to see some clashes because some other lib also
> manifests something. It might be better to allow for some flexibility and
> to adhere to classpath conventions. For example, having manifests in
> something like org/apache/opennlp/models/manifest.properties. Or
> opennlp/tools/manifest.properties. And perhaps even allowing to reference a
> model in the manifest, so the model can be put elsewhere. Just in case
> there are several custom models of the same kind for different pipelines in
> the same app. For example, processing queries with one pipeline - one set
> of models - and processing documents with another pipeline - another set of
> models. In this case allowing for different classpaths is needed.
>
> Perhaps to illustrate my thinking, something like this (which still keeps a
> lot of possibilities open):
> en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
> a line with something like model =
> /opennlp/tools/sentdetect/model/sent.model)
> en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>
> This allows including en-sent.bin as dependency. And then doing something
> like
> SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
> default models in this way. Seems verbose enough to allow for some safety
> through explicitness. That's if we want any defaults at all.
> Or something like:
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
> Or
> SentenceModel sdm =
> SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
> Or more in-line with a current style:
> SentenceModel sdm = new
> SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
> we commit to interpreting String as classpath reference. That's why I'd
> prefer more explicit method names.
> Or leave dealing with resources to the users, leave current code intact and
> provide only packaging and distribution:
> SentenceModel sdm = new
> SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> model"));
>
>
> And to add to model metadata also F1\accuracy (at least CV-based, for
> example 10-fold) for quick reference or quick understanding of what that
> model is capable of. Could be helpful for those with a bunch of models
> around. And for others as well to have a better insight about the model in
> question.
>
>
>
> On 11 July 2017 at 06:37, Chris Mattmann  wrote:
>
>> Hi,
>>
>> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
>> override an
>> internal classpath dependency. This is for people in environments who want
>> a sensible
>> / delivered internal classpath default and the ability for run-time, non
>> zipped up/messing
>> with JAR file override. Think about people who are using OpenNLP in both
>> Java/Python
>> environments as an example.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> On 7/11/17, 3:25 AM, "Joern Kottmann"  wrote:
>>
>> I would not change the CLI to load models from jar files. I never used
>> or saw a command line tool that expects a file as an input and would
>> then also load it from inside a jar file. It will be hard to
>> communicate how that works precisely in the CLI usage texts and this
>> is not a feature anyone would expect to be there. The intention of the
>> CLI is to give users the ability to quickly test OpenNLP before they
>> integrate it into their software and to train and evaluate models
>>
>> Users who for some reason have a jar file with a model inside can 

Re: Releasing a Language Detection Model

2017-07-11 Thread Aliaksandr Autayeu
To clarify on models and jars.

Putting model inside jar might not be a good idea. I mean here things like
bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
already in a sense. We're good. However, current packaging and metadata
might not be very classpath friendly.

The use case I have in mind is being able to add needed models as
dependencies and load them by writing a line of code. For this case having
all models in a root with the same name might not be very convenient. Same
goes for manifest. The name "manifest.properties" is quite generic and it's
not too far-fetched to see some clashes because some other lib also
manifests something. It might be better to allow for some flexibility and
to adhere to classpath conventions. For example, having manifests in
something like org/apache/opennlp/models/manifest.properties. Or
opennlp/tools/manifest.properties. And perhaps even allowing to reference a
model in the manifest, so the model can be put elsewhere. Just in case
there are several custom models of the same kind for different pipelines in
the same app. For example, processing queries with one pipeline - one set
of models - and processing documents with another pipeline - another set of
models. In this case allowing for different classpaths is needed.

Perhaps to illustrate my thinking, something like this (which still keeps a
lot of possibilities open):
en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
a line with something like model =
/opennlp/tools/sentdetect/model/sent.model)
en-sent.bin/opennlp/tools/sentdetect/model/sent.model

This allows including en-sent.bin as dependency. And then doing something
like
SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
default models in this way. Seems verbose enough to allow for some safety
through explicitness. That's if we want any defaults at all.
Or something like:
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
Or
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
Or more in-line with a current style:
SentenceModel sdm = new
SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
we commit to interpreting String as classpath reference. That's why I'd
prefer more explicit method names.
Or leave dealing with resources to the users, leave current code intact and
provide only packaging and distribution:
SentenceModel sdm = new
SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
model"));


And to add to model metadata also F1\accuracy (at least CV-based, for
example 10-fold) for quick reference or quick understanding of what that
model is capable of. Could be helpful for those with a bunch of models
around. And for others as well to have a better insight about the model in
question.



On 11 July 2017 at 06:37, Chris Mattmann  wrote:

> Hi,
>
> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
> override an
> internal classpath dependency. This is for people in environments who want
> a sensible
> / delivered internal classpath default and the ability for run-time, non
> zipped up/messing
> with JAR file override. Think about people who are using OpenNLP in both
> Java/Python
> environments as an example.
>
> Cheers,
> Chris
>
>
>
>
> On 7/11/17, 3:25 AM, "Joern Kottmann"  wrote:
>
> I would not change the CLI to load models from jar files. I never used
> or saw a command line tool that expects a file as an input and would
> then also load it from inside a jar file. It will be hard to
> communicate how that works precisely in the CLI usage texts and this
> is not a feature anyone would expect to be there. The intention of the
> CLI is to give users the ability to quickly test OpenNLP before they
> integrate it into their software and to train and evaluate models
>
> Users who for some reason have a jar file with a model inside can just
> write "unzip model.jar".
>
> After all I think this is quite  a bit of complexity we would need to
> add for it and it will have very limited use.
>
> The use case of publishing jar files is to make the models easily
> available to people who have a build system with dependency
> management, they won't have to download models manually, and when they
> update OpenNLP then can also update the models with a version string
> change.
>
> For the command line "quick start" use case we should offer the models
> on a download page as we do today. This page could list both, the
> download link and the maven dependency.
>
> Jörn
>
> On Mon, Jul 10, 2017 at 8:50 PM, William Colen 
> wrote:
> > We need to address things such as sharing the evaluation results and
> how to
> > reproduce the training.
> >
> > There are several possibilities for 

Re: Releasing a Language Detection Model

2017-07-11 Thread Chris Mattmann
Hi,

FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to 
override an
internal classpath dependency. This is for people in environments who want a 
sensible
/ delivered internal classpath default and the ability for run-time, non zipped 
up/messing
with JAR file override. Think about people who are using OpenNLP in both 
Java/Python
environments as an example.

Cheers,
Chris




On 7/11/17, 3:25 AM, "Joern Kottmann"  wrote:

I would not change the CLI to load models from jar files. I never used
or saw a command line tool that expects a file as an input and would
then also load it from inside a jar file. It will be hard to
communicate how that works precisely in the CLI usage texts and this
is not a feature anyone would expect to be there. The intention of the
CLI is to give users the ability to quickly test OpenNLP before they
integrate it into their software and to train and evaluate models

Users who for some reason have a jar file with a model inside can just
write "unzip model.jar".

After all I think this is quite  a bit of complexity we would need to
add for it and it will have very limited use.

The use case of publishing jar files is to make the models easily
available to people who have a build system with dependency
management, they won't have to download models manually, and when they
update OpenNLP then can also update the models with a version string
change.

For the command line "quick start" use case we should offer the models
on a download page as we do today. This page could list both, the
download link and the maven dependency.

Jörn

On Mon, Jul 10, 2017 at 8:50 PM, William Colen  wrote:
> We need to address things such as sharing the evaluation results and how 
to
> reproduce the training.
>
> There are several possibilities for that, but there are points to 
consider:
>
> Will we store the model itself in a SCM repository or only the code that
> can build it?
> Will we deploy the models to a Maven Central repository? It is good for
> people using the Java API but not for command line interface, should we
> change the CLI to handle models in the classpath?
> Should we keep a copy of the training model or always download from the
> original provider? We can't guarantee that the corpus will be there
> forever, not only because it changed license, but simple because the
> provider is not keeping the server up anymore.
>
> William
>
>
>
> 2017-07-10 14:52 GMT-03:00 Joern Kottmann :
>
>> Hello all,
>>
>> since Apache OpenNLP 1.8.1 we have a new language detection component
>> which like all our components has to be trained. I think we should
>> release a pre-build model for it trained on the Leipzig corpus. This
>> will allow the majority of our users to get started very quickly with
>> language detection without the need to figure out on how to train it.
>>
>> How should this project release models?
>>
>> Jörn
>>





Re: Releasing a Language Detection Model

2017-07-11 Thread Joern Kottmann
I would not change the CLI to load models from jar files. I never used
or saw a command line tool that expects a file as an input and would
then also load it from inside a jar file. It will be hard to
communicate how that works precisely in the CLI usage texts and this
is not a feature anyone would expect to be there. The intention of the
CLI is to give users the ability to quickly test OpenNLP before they
integrate it into their software and to train and evaluate models

Users who for some reason have a jar file with a model inside can just
write "unzip model.jar".

After all I think this is quite  a bit of complexity we would need to
add for it and it will have very limited use.

The use case of publishing jar files is to make the models easily
available to people who have a build system with dependency
management, they won't have to download models manually, and when they
update OpenNLP then can also update the models with a version string
change.

For the command line "quick start" use case we should offer the models
on a download page as we do today. This page could list both, the
download link and the maven dependency.

Jörn

On Mon, Jul 10, 2017 at 8:50 PM, William Colen  wrote:
> We need to address things such as sharing the evaluation results and how to
> reproduce the training.
>
> There are several possibilities for that, but there are points to consider:
>
> Will we store the model itself in a SCM repository or only the code that
> can build it?
> Will we deploy the models to a Maven Central repository? It is good for
> people using the Java API but not for command line interface, should we
> change the CLI to handle models in the classpath?
> Should we keep a copy of the training model or always download from the
> original provider? We can't guarantee that the corpus will be there
> forever, not only because it changed license, but simple because the
> provider is not keeping the server up anymore.
>
> William
>
>
>
> 2017-07-10 14:52 GMT-03:00 Joern Kottmann :
>
>> Hello all,
>>
>> since Apache OpenNLP 1.8.1 we have a new language detection component
>> which like all our components has to be trained. I think we should
>> release a pre-build model for it trained on the Leipzig corpus. This
>> will allow the majority of our users to get started very quickly with
>> language detection without the need to figure out on how to train it.
>>
>> How should this project release models?
>>
>> Jörn
>>


Re: Releasing a Language Detection Model

2017-07-11 Thread Joern Kottmann
I am also not for default models. We are a library and people use it
inside other software products, that is the place where meaningful
defaults can be defined. Maybe our lang model works very well, you
take that, hard code it and forget for the next couple of years about
it, or it doesn't work and you train your own set of models and swap
them depending on your input data source.

And then there are solutions out there people can use to define
configuration for their software projects, such as spring or typesafe.
And probably something new one day. I am +1 to ensure that OpenNLP is
easy to use with the most common ones and to accept PRs to increase
ease of use.

Jörn

On Tue, Jul 11, 2017 at 3:45 AM,   wrote:
> +1 for releasing models
>
> as for the rest not sure how I feel.  Is there just one model for the 
> Language Detector? I don’t want this to become a versioning issue 
> langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone 
> download the Leipzig corpus? Being able to reproduce the model is very 
> powerful, because if you have additional data you can add it to the Leipzig 
> corpus to improve your model.
>
> I am not a big fan of default models, because it is frustrating as a using 
> when unexpected things happen (like if you thing you are telling it to use 
> your model, but it uses the default).  However, if the code is verbose 
> enough, this is really not a valid concern.  I would want to see the use case 
> develop.
> Daniel
>
>
>> On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu  
>> wrote:
>>
>> Great idea!
>>
>> +1 for releasing models.
>>
>> +1 to publish models in jars on Maven Central. This is the fastest way to
>> have somebody started. Moreover, having an extensible mechanism for others
>> to do it on their own is really helpful. I did this with extJWNL for
>> packaging WordNet data files. It is also convenient for packaging own
>> custom dictionaries and providing them via repositories. It reuses existing
>> infrastructure for things like versioning and distribution. Model metadata
>> has to be thought through though. Oh, what a mouthful...
>>
>> +1 for separate download ("no dependency manager" cases)
>>
>> +1 to publish data\scripts\provenance. The more reproducible it is, the
>> better.
>>
>> +1 for some mechanism of loading models from classpath.
>>
>> ~ +1 to maybe explore classpath for a "default" model for API (code) use
>> cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
>> extJWNL. But this has to be well thought through as design mistakes here
>> might release some demons from jar hell. I didn't face it, but I'm not sure
>> the extJWNL design is best as I didn't do much research on alternatives.
>> And I'd think twice before adding model jars to main binary distribution.
>>
>> +1 to store only the model-building-code in SCM repo. I would not bloat the
>> SCM with binaries. Maven repositories though not ideal, are better for this
>> than SCM (and there specialized tools like jFrog).
>>
>> ~ -1 about changing CLI to use models from classpath. There was no
>> proposal, but my understanding that it would be some sort of classpath://
>> URL - please correct or clarify. I'd like to see the proposal and use cases
>> where it is more convenient than current way of just pointing to the file.
>> Perhaps it depends. Our models are already zips with manifests. Jars are
>> zips too. Perhaps changing the model packaging layout to make it more
>> "jar-like" or augmenting it with metadata for searching default models from
>> classpath for the above cases of distributing through Maven repositories
>> and loading from code, but perhaps leaving CLI as is - even if your model
>> is technically on the classpath, in most cases you can point to a jar in
>> the file system and thus leave CLI like it is now. It seems that dealing
>> with classpath is more suitable (convenient, safer, customary, ...) for
>> developers fiddling with code than for users fiddling with command-line.
>>
>> +1 for mirroring source corpora. The more reproducible things are the
>> better. But costs (infrastructure) and licenses (this looks like
>> redistribution which is not always allowed) might be the case.
>>
>> I'd also propose to augment model metadata with (optional) information
>> about source corpora, provenance, as much reproduction information as
>> possible, etc. Mostly for easier reproduction and provenance tracking. In
>> my experience I had challenges recalling what y-d-u-en.bin was trained on,
>> on which revision of that corpus, which part or subset, language, and
>> whether it had also other annotations (and respective models) for
>> connecting all the possible models from that corpora (e.g.
>> sent-tok-pos-chunk-...).
>>
>> Aliaksandr
>>
>> On 10 July 2017 at 17:41, Jeff Zemerick  wrote:
>>
>>> +1 to an opennlp-models jar on Maven Central that contains the models.
>>> +1 to having the models