Re: OpenNLP UD models
Thanks! I think the evaluation results are also in that Dropbox folder. I will double-check to be sure. I don't think the Named Entity Finder currently supports the CoNLL-U format that the UD uses. I think we need to add support for connlu for the Named Entity Finder and then we can train those models. I picked those languages mostly at random. I wanted languages that I thought might appeal to the most users for a first release. We can certainly expand the models to each of the other languages in the future, assuming those languages have sufficient training data to make a decent model. Thanks, Jeff On Mon, Jan 18, 2021 at 12:50 PM William Colen wrote: > Hello Jeff! Nice work!! > > Did you store the evaluation results somewhere? > > Does UD have Named Entity annotation? Do you have any reference to share? > > Why did you select only these languages? Any restrictions? > > Thank you > William > > Em dom., 17 de jan. de 2021 às 21:15, Jeff Zemerick > escreveu: > > > Thanks, Bruno. > > > > If there aren't any major concerns I will kick off a VOTE thread for > > releasing these models. > > > > The overall plan is to: > > > > 1. Release these models by making them available for download on the > > website. > > 2. Submit the pull request to enable automatic downloading for the > > tokenizer, sentence, and POS tagger models. > > 3. Update user's guide and release new version. > > 4. Get NameFinder models trained and available. > > 5. Establish a more automated and documented process for training the > > models. > > > > Always open to suggestions and comments! Otherwise watch for a VOTE > > thread over the next few days. > > > > Thanks, > > Jeff > > > > > > On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita > > wrote: > > > > > Hi Jeff, > > > > > > Cannot comment much on the process or direction, except that it looks > > good > > > to me. > > > > > > >While decent performance is always beneficial, the primary purpose > > > of this task is to provide working OpenNLP models the project can > > > distribute. Having these models will help reduce the barrier to entry > for > > > users new to OpenNLP. > > > > > > +1! Had a read on the UD page, and looks well maintained, and even > > > includes a pt-br dataset. > > > > > > Thanks! > > > Bruno > > > > > > > > > > > > > > > > > > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick < > > > jzemer...@apache.org> wrote: > > > > > > Hi all, > > > > > > I have created a script [1] to train OpenNLP models from Universal > > > Dependencies [2] data to give OpenNLP models that can be distributed > > under > > > the Apache license, > > > > > > The script automates the training of tokenizer, sentence, and POS > models > > > for English, Dutch, French, German, and Italian. (The NameFinder does > not > > > currently support the input annotation format so those models will come > > > later.) While decent performance is always beneficial, the primary > > purpose > > > of this task is to provide working OpenNLP models the project can > > > distribute. Having these models will help reduce the barrier to entry > for > > > users new to OpenNLP. > > > > > > Once voted and approved, the trained models will be pushed to > Subversion > > > alongside the current OpenNLP language detection model. From there, the > > > models can be made available for download on the OpenNLP website and > > > programmatically through OPENNLP-1318 [3]. The script to train the > models > > > and instructions will be added to the OpenNLP repository. > > > > > > To use the script: > > > > > > 1. Download and extract UD. > > > 2. Download and extract OpenNLP. > > > 3. Create a directory to store the trained models. > > > 3. Modify the ud-train.sh script to set the path to those three > > > directories. > > > 4. Execute the ud-train.sh script. > > > > > > The training log, evaluation output, and model files will be saved to > the > > > $OUTPUT_MODELS directory. Models and the output files I trained using > > this > > > script can be viewed on Dropbox [4]. > > > > > > Before calling a vote to release the models, I would like to see if > there > > > is any feedback on the process or direction. If you have any comments > > > please feel free. > > > > > > Thanks, > > > Jeff > > > > > > [1] > > > > https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh > > > [2] https://universaldependencies.org/ > > > [3] https://issues.apache.org/jira/browse/OPENNLP-1318 > > > [4] > > > > > > https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0 > > > > > >
Re: OpenNLP UD models
Hello Jeff! Nice work!! Did you store the evaluation results somewhere? Does UD have Named Entity annotation? Do you have any reference to share? Why did you select only these languages? Any restrictions? Thank you William Em dom., 17 de jan. de 2021 às 21:15, Jeff Zemerick escreveu: > Thanks, Bruno. > > If there aren't any major concerns I will kick off a VOTE thread for > releasing these models. > > The overall plan is to: > > 1. Release these models by making them available for download on the > website. > 2. Submit the pull request to enable automatic downloading for the > tokenizer, sentence, and POS tagger models. > 3. Update user's guide and release new version. > 4. Get NameFinder models trained and available. > 5. Establish a more automated and documented process for training the > models. > > Always open to suggestions and comments! Otherwise watch for a VOTE > thread over the next few days. > > Thanks, > Jeff > > > On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita > wrote: > > > Hi Jeff, > > > > Cannot comment much on the process or direction, except that it looks > good > > to me. > > > > >While decent performance is always beneficial, the primary purpose > > of this task is to provide working OpenNLP models the project can > > distribute. Having these models will help reduce the barrier to entry for > > users new to OpenNLP. > > > > +1! Had a read on the UD page, and looks well maintained, and even > > includes a pt-br dataset. > > > > Thanks! > > Bruno > > > > > > > > > > > > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick < > > jzemer...@apache.org> wrote: > > > > Hi all, > > > > I have created a script [1] to train OpenNLP models from Universal > > Dependencies [2] data to give OpenNLP models that can be distributed > under > > the Apache license, > > > > The script automates the training of tokenizer, sentence, and POS models > > for English, Dutch, French, German, and Italian. (The NameFinder does not > > currently support the input annotation format so those models will come > > later.) While decent performance is always beneficial, the primary > purpose > > of this task is to provide working OpenNLP models the project can > > distribute. Having these models will help reduce the barrier to entry for > > users new to OpenNLP. > > > > Once voted and approved, the trained models will be pushed to Subversion > > alongside the current OpenNLP language detection model. From there, the > > models can be made available for download on the OpenNLP website and > > programmatically through OPENNLP-1318 [3]. The script to train the models > > and instructions will be added to the OpenNLP repository. > > > > To use the script: > > > > 1. Download and extract UD. > > 2. Download and extract OpenNLP. > > 3. Create a directory to store the trained models. > > 3. Modify the ud-train.sh script to set the path to those three > > directories. > > 4. Execute the ud-train.sh script. > > > > The training log, evaluation output, and model files will be saved to the > > $OUTPUT_MODELS directory. Models and the output files I trained using > this > > script can be viewed on Dropbox [4]. > > > > Before calling a vote to release the models, I would like to see if there > > is any feedback on the process or direction. If you have any comments > > please feel free. > > > > Thanks, > > Jeff > > > > [1] > > https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh > > [2] https://universaldependencies.org/ > > [3] https://issues.apache.org/jira/browse/OPENNLP-1318 > > [4] > > > https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0 > > >
Re: OpenNLP UD models
Thanks, Bruno. If there aren't any major concerns I will kick off a VOTE thread for releasing these models. The overall plan is to: 1. Release these models by making them available for download on the website. 2. Submit the pull request to enable automatic downloading for the tokenizer, sentence, and POS tagger models. 3. Update user's guide and release new version. 4. Get NameFinder models trained and available. 5. Establish a more automated and documented process for training the models. Always open to suggestions and comments! Otherwise watch for a VOTE thread over the next few days. Thanks, Jeff On Wed, Jan 6, 2021 at 7:24 PM Bruno P. Kinoshita wrote: > Hi Jeff, > > Cannot comment much on the process or direction, except that it looks good > to me. > > >While decent performance is always beneficial, the primary purpose > of this task is to provide working OpenNLP models the project can > distribute. Having these models will help reduce the barrier to entry for > users new to OpenNLP. > > +1! Had a read on the UD page, and looks well maintained, and even > includes a pt-br dataset. > > Thanks! > Bruno > > > > > > On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick < > jzemer...@apache.org> wrote: > > Hi all, > > I have created a script [1] to train OpenNLP models from Universal > Dependencies [2] data to give OpenNLP models that can be distributed under > the Apache license, > > The script automates the training of tokenizer, sentence, and POS models > for English, Dutch, French, German, and Italian. (The NameFinder does not > currently support the input annotation format so those models will come > later.) While decent performance is always beneficial, the primary purpose > of this task is to provide working OpenNLP models the project can > distribute. Having these models will help reduce the barrier to entry for > users new to OpenNLP. > > Once voted and approved, the trained models will be pushed to Subversion > alongside the current OpenNLP language detection model. From there, the > models can be made available for download on the OpenNLP website and > programmatically through OPENNLP-1318 [3]. The script to train the models > and instructions will be added to the OpenNLP repository. > > To use the script: > > 1. Download and extract UD. > 2. Download and extract OpenNLP. > 3. Create a directory to store the trained models. > 3. Modify the ud-train.sh script to set the path to those three > directories. > 4. Execute the ud-train.sh script. > > The training log, evaluation output, and model files will be saved to the > $OUTPUT_MODELS directory. Models and the output files I trained using this > script can be viewed on Dropbox [4]. > > Before calling a vote to release the models, I would like to see if there > is any feedback on the process or direction. If you have any comments > please feel free. > > Thanks, > Jeff > > [1] > https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh > [2] https://universaldependencies.org/ > [3] https://issues.apache.org/jira/browse/OPENNLP-1318 > [4] > https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0 >
Re: OpenNLP UD models
Hi Jeff, Cannot comment much on the process or direction, except that it looks good to me. >While decent performance is always beneficial, the primary purpose of this task is to provide working OpenNLP models the project can distribute. Having these models will help reduce the barrier to entry for users new to OpenNLP. +1! Had a read on the UD page, and looks well maintained, and even includes a pt-br dataset. Thanks! Bruno On Wednesday, 6 January 2021, 11:31:32 am NZDT, Jeff Zemerick wrote: Hi all, I have created a script [1] to train OpenNLP models from Universal Dependencies [2] data to give OpenNLP models that can be distributed under the Apache license, The script automates the training of tokenizer, sentence, and POS models for English, Dutch, French, German, and Italian. (The NameFinder does not currently support the input annotation format so those models will come later.) While decent performance is always beneficial, the primary purpose of this task is to provide working OpenNLP models the project can distribute. Having these models will help reduce the barrier to entry for users new to OpenNLP. Once voted and approved, the trained models will be pushed to Subversion alongside the current OpenNLP language detection model. From there, the models can be made available for download on the OpenNLP website and programmatically through OPENNLP-1318 [3]. The script to train the models and instructions will be added to the OpenNLP repository. To use the script: 1. Download and extract UD. 2. Download and extract OpenNLP. 3. Create a directory to store the trained models. 3. Modify the ud-train.sh script to set the path to those three directories. 4. Execute the ud-train.sh script. The training log, evaluation output, and model files will be saved to the $OUTPUT_MODELS directory. Models and the output files I trained using this script can be viewed on Dropbox [4]. Before calling a vote to release the models, I would like to see if there is any feedback on the process or direction. If you have any comments please feel free. Thanks, Jeff [1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh [2] https://universaldependencies.org/ [3] https://issues.apache.org/jira/browse/OPENNLP-1318 [4] https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0
OpenNLP UD models
Hi all, I have created a script [1] to train OpenNLP models from Universal Dependencies [2] data to give OpenNLP models that can be distributed under the Apache license, The script automates the training of tokenizer, sentence, and POS models for English, Dutch, French, German, and Italian. (The NameFinder does not currently support the input annotation format so those models will come later.) While decent performance is always beneficial, the primary purpose of this task is to provide working OpenNLP models the project can distribute. Having these models will help reduce the barrier to entry for users new to OpenNLP. Once voted and approved, the trained models will be pushed to Subversion alongside the current OpenNLP language detection model. From there, the models can be made available for download on the OpenNLP website and programmatically through OPENNLP-1318 [3]. The script to train the models and instructions will be added to the OpenNLP repository. To use the script: 1. Download and extract UD. 2. Download and extract OpenNLP. 3. Create a directory to store the trained models. 3. Modify the ud-train.sh script to set the path to those three directories. 4. Execute the ud-train.sh script. The training log, evaluation output, and model files will be saved to the $OUTPUT_MODELS directory. Models and the output files I trained using this script can be viewed on Dropbox [4]. Before calling a vote to release the models, I would like to see if there is any feedback on the process or direction. If you have any comments please feel free. Thanks, Jeff [1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh [2] https://universaldependencies.org/ [3] https://issues.apache.org/jira/browse/OPENNLP-1318 [4] https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0