Re: OpenNLP UD models

2021-01-23 Thread Jeff Zemerick
Thanks!

I think the evaluation results are also in that Dropbox folder. I will
double-check to be sure.

I don't think the Named Entity Finder currently supports the CoNLL-U format
that the UD uses. I think we need to add support for connlu for the Named
Entity Finder and then we can train those models.

I picked those languages mostly at random. I wanted languages that I
thought might appeal to the most users for a first release. We can
certainly expand the models to each of the other languages in the future,
assuming those languages have sufficient training data to make a decent
model.

Thanks,
Jeff


On Mon, Jan 18, 2021 at 12:50 PM William Colen 
wrote:

> Hello Jeff! Nice work!!
>
> Did you store the evaluation results somewhere?
>
> Does UD have Named Entity annotation? Do you have any reference to share?
>
> Why did you select only these languages? Any restrictions?
>
> Thank you
> William
>
> Em dom., 17 de jan. de 2021 às 21:15, Jeff Zemerick 
> escreveu:
>
> > Thanks, Bruno.
> >
> > If there aren't any major concerns I will kick off a VOTE thread for
> > releasing these models.
> >
> > The overall plan is to:
> >
> > 1. Release these models by making them available for download on the
> > website.
> > 2. Submit the pull request to enable automatic downloading for the
> > tokenizer, sentence, and POS tagger models.
> > 3. Update user's guide and release new version.
> > 4. Get NameFinder models trained and available.
> > 5. Establish a more automated and documented process for training the
> > models.
> >
> > Always open to suggestions and comments! Otherwise watch for a VOTE
> > thread over the next few days.
> >
> > Thanks,
> > Jeff
> >
> >
> > On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita 
> > wrote:
> >
> > >  Hi Jeff,
> > >
> > > Cannot comment much on the process or direction, except that it looks
> > good
> > > to me.
> > >
> > > >While decent performance is always beneficial, the primary purpose
> > > of this task is to provide working OpenNLP models the project can
> > > distribute. Having these models will help reduce the barrier to entry
> for
> > > users new to OpenNLP.
> > >
> > > +1! Had a read on the UD page, and looks well maintained, and even
> > > includes a pt-br dataset.
> > >
> > > Thanks!
> > > Bruno
> > >
> > >
> > >
> > >
> > >
> > > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick <
> > > jzemer...@apache.org> wrote:
> > >
> > >  Hi all,
> > >
> > > I have created a script [1] to train OpenNLP models from Universal
> > > Dependencies [2] data to give OpenNLP models that can be distributed
> > under
> > > the Apache license,
> > >
> > > The script automates the training of tokenizer, sentence, and POS
> models
> > > for English, Dutch, French, German, and Italian. (The NameFinder does
> not
> > > currently support the input annotation format so those models will come
> > > later.) While decent performance is always beneficial, the primary
> > purpose
> > > of this task is to provide working OpenNLP models the project can
> > > distribute. Having these models will help reduce the barrier to entry
> for
> > > users new to OpenNLP.
> > >
> > > Once voted and approved, the trained models will be pushed to
> Subversion
> > > alongside the current OpenNLP language detection model. From there, the
> > > models can be made available for download on the OpenNLP website and
> > > programmatically through OPENNLP-1318 [3]. The script to train the
> models
> > > and instructions will be added to the OpenNLP repository.
> > >
> > > To use the script:
> > >
> > > 1. Download and extract UD.
> > > 2. Download and extract OpenNLP.
> > > 3. Create a directory to store the trained models.
> > > 3. Modify the ud-train.sh script to set the path to those three
> > > directories.
> > > 4. Execute the ud-train.sh script.
> > >
> > > The training log, evaluation output, and model files will be saved to
> the
> > > $OUTPUT_MODELS directory. Models and the output files I trained using
> > this
> > > script can be viewed on Dropbox [4].
> > >
> > > Before calling a vote to release the models, I would like to see if
> there
> > > is any feedback on the process or direction. If you have any comments
> > > please feel free.
> > >
> > > Thanks,
> > > Jeff
> > >
> > > [1]
> > >
> https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
> > > [2] https://universaldependencies.org/
> > > [3] https://issues.apache.org/jira/browse/OPENNLP-1318
> > > [4]
> > >
> >
> https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
> > >
> >
>


Re: OpenNLP UD models

2021-01-18 Thread William Colen
Hello Jeff! Nice work!!

Did you store the evaluation results somewhere?

Does UD have Named Entity annotation? Do you have any reference to share?

Why did you select only these languages? Any restrictions?

Thank you
William

Em dom., 17 de jan. de 2021 às 21:15, Jeff Zemerick 
escreveu:

> Thanks, Bruno.
>
> If there aren't any major concerns I will kick off a VOTE thread for
> releasing these models.
>
> The overall plan is to:
>
> 1. Release these models by making them available for download on the
> website.
> 2. Submit the pull request to enable automatic downloading for the
> tokenizer, sentence, and POS tagger models.
> 3. Update user's guide and release new version.
> 4. Get NameFinder models trained and available.
> 5. Establish a more automated and documented process for training the
> models.
>
> Always open to suggestions and comments! Otherwise watch for a VOTE
> thread over the next few days.
>
> Thanks,
> Jeff
>
>
> On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita 
> wrote:
>
> >  Hi Jeff,
> >
> > Cannot comment much on the process or direction, except that it looks
> good
> > to me.
> >
> > >While decent performance is always beneficial, the primary purpose
> > of this task is to provide working OpenNLP models the project can
> > distribute. Having these models will help reduce the barrier to entry for
> > users new to OpenNLP.
> >
> > +1! Had a read on the UD page, and looks well maintained, and even
> > includes a pt-br dataset.
> >
> > Thanks!
> > Bruno
> >
> >
> >
> >
> >
> > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick <
> > jzemer...@apache.org> wrote:
> >
> >  Hi all,
> >
> > I have created a script [1] to train OpenNLP models from Universal
> > Dependencies [2] data to give OpenNLP models that can be distributed
> under
> > the Apache license,
> >
> > The script automates the training of tokenizer, sentence, and POS models
> > for English, Dutch, French, German, and Italian. (The NameFinder does not
> > currently support the input annotation format so those models will come
> > later.) While decent performance is always beneficial, the primary
> purpose
> > of this task is to provide working OpenNLP models the project can
> > distribute. Having these models will help reduce the barrier to entry for
> > users new to OpenNLP.
> >
> > Once voted and approved, the trained models will be pushed to Subversion
> > alongside the current OpenNLP language detection model. From there, the
> > models can be made available for download on the OpenNLP website and
> > programmatically through OPENNLP-1318 [3]. The script to train the models
> > and instructions will be added to the OpenNLP repository.
> >
> > To use the script:
> >
> > 1. Download and extract UD.
> > 2. Download and extract OpenNLP.
> > 3. Create a directory to store the trained models.
> > 3. Modify the ud-train.sh script to set the path to those three
> > directories.
> > 4. Execute the ud-train.sh script.
> >
> > The training log, evaluation output, and model files will be saved to the
> > $OUTPUT_MODELS directory. Models and the output files I trained using
> this
> > script can be viewed on Dropbox [4].
> >
> > Before calling a vote to release the models, I would like to see if there
> > is any feedback on the process or direction. If you have any comments
> > please feel free.
> >
> > Thanks,
> > Jeff
> >
> > [1]
> > https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
> > [2] https://universaldependencies.org/
> > [3] https://issues.apache.org/jira/browse/OPENNLP-1318
> > [4]
> >
> https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
> >
>


Re: OpenNLP UD models

2021-01-17 Thread Jeff Zemerick
Thanks, Bruno.

If there aren't any major concerns I will kick off a VOTE thread for
releasing these models.

The overall plan is to:

1. Release these models by making them available for download on the
website.
2. Submit the pull request to enable automatic downloading for the
tokenizer, sentence, and POS tagger models.
3. Update user's guide and release new version.
4. Get NameFinder models trained and available.
5. Establish a more automated and documented process for training the
models.

Always open to suggestions and comments! Otherwise watch for a VOTE
thread over the next few days.

Thanks,
Jeff


On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita  wrote:

>  Hi Jeff,
>
> Cannot comment much on the process or direction, except that it looks good
> to me.
>
> >While decent performance is always beneficial, the primary purpose
> of this task is to provide working OpenNLP models the project can
> distribute. Having these models will help reduce the barrier to entry for
> users new to OpenNLP.
>
> +1! Had a read on the UD page, and looks well maintained, and even
> includes a pt-br dataset.
>
> Thanks!
> Bruno
>
>
>
>
>
> On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick <
> jzemer...@apache.org> wrote:
>
>  Hi all,
>
> I have created a script [1] to train OpenNLP models from Universal
> Dependencies [2] data to give OpenNLP models that can be distributed under
> the Apache license,
>
> The script automates the training of tokenizer, sentence, and POS models
> for English, Dutch, French, German, and Italian. (The NameFinder does not
> currently support the input annotation format so those models will come
> later.) While decent performance is always beneficial, the primary purpose
> of this task is to provide working OpenNLP models the project can
> distribute. Having these models will help reduce the barrier to entry for
> users new to OpenNLP.
>
> Once voted and approved, the trained models will be pushed to Subversion
> alongside the current OpenNLP language detection model. From there, the
> models can be made available for download on the OpenNLP website and
> programmatically through OPENNLP-1318 [3]. The script to train the models
> and instructions will be added to the OpenNLP repository.
>
> To use the script:
>
> 1. Download and extract UD.
> 2. Download and extract OpenNLP.
> 3. Create a directory to store the trained models.
> 3. Modify the ud-train.sh script to set the path to those three
> directories.
> 4. Execute the ud-train.sh script.
>
> The training log, evaluation output, and model files will be saved to the
> $OUTPUT_MODELS directory. Models and the output files I trained using this
> script can be viewed on Dropbox [4].
>
> Before calling a vote to release the models, I would like to see if there
> is any feedback on the process or direction. If you have any comments
> please feel free.
>
> Thanks,
> Jeff
>
> [1]
> https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
> [2] https://universaldependencies.org/
> [3] https://issues.apache.org/jira/browse/OPENNLP-1318
> [4]
> https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
>


Re: OpenNLP UD models

2021-01-06 Thread Bruno P. Kinoshita
 Hi Jeff,

Cannot comment much on the process or direction, except that it looks good to 
me.

>While decent performance is always beneficial, the primary purpose
of this task is to provide working OpenNLP models the project can
distribute. Having these models will help reduce the barrier to entry for users 
new to OpenNLP.

+1! Had a read on the UD page, and looks well maintained, and even includes a 
pt-br dataset.

Thanks!
Bruno





On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick 
 wrote:  
 
 Hi all,

I have created a script [1] to train OpenNLP models from Universal
Dependencies [2] data to give OpenNLP models that can be distributed under
the Apache license,

The script automates the training of tokenizer, sentence, and POS models
for English, Dutch, French, German, and Italian. (The NameFinder does not
currently support the input annotation format so those models will come
later.) While decent performance is always beneficial, the primary purpose
of this task is to provide working OpenNLP models the project can
distribute. Having these models will help reduce the barrier to entry for
users new to OpenNLP.

Once voted and approved, the trained models will be pushed to Subversion
alongside the current OpenNLP language detection model. From there, the
models can be made available for download on the OpenNLP website and
programmatically through OPENNLP-1318 [3]. The script to train the models
and instructions will be added to the OpenNLP repository.

To use the script:

1. Download and extract UD.
2. Download and extract OpenNLP.
3. Create a directory to store the trained models.
3. Modify the ud-train.sh script to set the path to those three directories.
4. Execute the ud-train.sh script.

The training log, evaluation output, and model files will be saved to the
$OUTPUT_MODELS directory. Models and the output files I trained using this
script can be viewed on Dropbox [4].

Before calling a vote to release the models, I would like to see if there
is any feedback on the process or direction. If you have any comments
please feel free.

Thanks,
Jeff

[1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
[2] https://universaldependencies.org/
[3] https://issues.apache.org/jira/browse/OPENNLP-1318
[4]
https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
  

OpenNLP UD models

2021-01-05 Thread Jeff Zemerick
Hi all,

I have created a script [1] to train OpenNLP models from Universal
Dependencies [2] data to give OpenNLP models that can be distributed under
the Apache license,

The script automates the training of tokenizer, sentence, and POS models
for English, Dutch, French, German, and Italian. (The NameFinder does not
currently support the input annotation format so those models will come
later.) While decent performance is always beneficial, the primary purpose
of this task is to provide working OpenNLP models the project can
distribute. Having these models will help reduce the barrier to entry for
users new to OpenNLP.

Once voted and approved, the trained models will be pushed to Subversion
alongside the current OpenNLP language detection model. From there, the
models can be made available for download on the OpenNLP website and
programmatically through OPENNLP-1318 [3]. The script to train the models
and instructions will be added to the OpenNLP repository.

To use the script:

1. Download and extract UD.
2. Download and extract OpenNLP.
3. Create a directory to store the trained models.
3. Modify the ud-train.sh script to set the path to those three directories.
4. Execute the ud-train.sh script.

The training log, evaluation output, and model files will be saved to the
$OUTPUT_MODELS directory. Models and the output files I trained using this
script can be viewed on Dropbox [4].

Before calling a vote to release the models, I would like to see if there
is any feedback on the process or direction. If you have any comments
please feel free.

Thanks,
Jeff

[1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
[2] https://universaldependencies.org/
[3] https://issues.apache.org/jira/browse/OPENNLP-1318
[4]
https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0