Re: Re: English 300k sentences Leipzig Corpus for test
Hi, I could not find a way to convert from Leipzig to other formats than DocCat sample. Is it possible to convert from Leipzig to SentenceSample using the OpenNLP tools? Thank you, William On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote: Original Message Subject:Re: English 300k sentences Leipzig Corpus for test Date: Thu, 14 Mar 2013 09:48:21 -0300 From: William Colen william.co...@gmail.com To: Jörn Kottmann kottm...@gmail.com Yes, you can forward. It is not clear to me how to convert it. I could only find converters from Leipzig to DocCat. On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote: Do you mind if I forward this to the dev list? Yes, you need to convert the data into input data. The idea is that we process the data with 1.5.2 and 1.5.3 and see if the output is still identical, if its not identical its either a change in our code or a bug. It doesn't really matter which file you download as long as it has enough sentences, would be nice if you can note in the test plan which one you used. Hopefully I will have sometime over the weekend to do the tests on the private data I have. Jörn On 03/13/2013 11:38 PM, William Colen wrote: Hi, Jörn, I would like to start testing with Leipzig Corpus. Do you know how the steps to do it? I downloaded the file named eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/** Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz, and now I would use the converter to extract documents from it. After that, I would try to use the output of a module as input to the next. Is it correct? Thank you, William
Re: English 300k sentences Leipzig Corpus for test
If I remember correctly the file is already sentences by line, I used the tokenizer to tokenize it, and the POS Tagger to pos tag it. After you did that you have input files for all the tools. Maybe you need to remove the sentence id at the begin, e.g. with sed. Anyway you can also leave it there, it doesn't really matter for this test. Jörn On 03/14/2013 03:45 PM, William Colen wrote: Hi, I could not find a way to convert from Leipzig to other formats than DocCat sample. Is it possible to convert from Leipzig to SentenceSample using the OpenNLP tools? Thank you, William On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote: Original Message Subject:Re: English 300k sentences Leipzig Corpus for test Date: Thu, 14 Mar 2013 09:48:21 -0300 From: William Colen william.co...@gmail.com To: Jörn Kottmann kottm...@gmail.com Yes, you can forward. It is not clear to me how to convert it. I could only find converters from Leipzig to DocCat. On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote: Do you mind if I forward this to the dev list? Yes, you need to convert the data into input data. The idea is that we process the data with 1.5.2 and 1.5.3 and see if the output is still identical, if its not identical its either a change in our code or a bug. It doesn't really matter which file you download as long as it has enough sentences, would be nice if you can note in the test plan which one you used. Hopefully I will have sometime over the weekend to do the tests on the private data I have. Jörn On 03/13/2013 11:38 PM, William Colen wrote: Hi, Jörn, I would like to start testing with Leipzig Corpus. Do you know how the steps to do it? I downloaded the file named eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/** Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz, and now I would use the converter to extract documents from it. After that, I would try to use the output of a module as input to the next. Is it correct? Thank you, William
Re: OpenNLP 1.5.3 RC 2 ready for testing
Hi William, No, I think it will be fine. The problem only lies in data where there is back to back names being tagged in the sentences. The unfixed prior models would invalidly tag them with the wrong type... i.e.: both could be the same type such as person instead of the different types one person and the other maybe miscellaneous. In some of the models; especially the combined Name Finder models that contained all the tags ... were affected most; since, the likelihood of back to back tags is higher. In the English models there were 3 sentences that had improper tags before ... now have the correct tags with the fixes. This improved the scores a bit. It should produce identical models since the problem was with the output tagging and not with the training of the models. James On 3/14/2013 11:00 PM, William Colen wrote: Hi, James, Thank you for the warning. It didn't affect the test with the Leipzig corpus: the output from 1.5.2 and 1.5.3 are identical. Do you think we should better manually check the output? Thank you, William On Thu, Mar 14, 2013 at 12:09 AM, James Kosin james.ko...@gmail.com wrote: Hi all, Note, that we will have some discrepancies in the model performance for some of the tests in the NameFinder models due to OPENNLP-417 that fixes the back-to-back name tags. It should really be limited to the combined name tags; but, could also affect others. James On 3/8/2013 9:11 AM, William Colen wrote: Hi all, Our second release candidate is ready for testing. RC1 failed to pass the initial quality check. The RC 2 can be downloaded from here: http://people.apache.org/~**colen/releases/opennlp-1.5.3/**rc2/http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/ To use it in a maven build set the version for opennlp-tools or opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL to your settings.xml file: https://repository.apache.org/**content/repositories/** orgapacheopennlp-005/https://repository.apache.org/content/repositories/orgapacheopennlp-005/ The current test plan can be found here: https://cwiki.apache.org/**OPENNLP/testplan153.htmlhttps://cwiki.apache.org/OPENNLP/testplan153.html Please sign up for tasks in the test plan. The release plan can be found here: https://cwiki.apache.org/**OPENNLP/**releaseplanandtasks153.htmlhttps://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html The RC contains quite some changes, please refer to the contained issue list for details. William