Re: Automated testing with public data
Since OpenNLP is cross-platform/Java-based, something that works cross-platform/Java-based might be better than wget. I'm using Ant scripts for such tasks. -- Richard On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote: +1 The script would also be great for documentation. 2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com: Or we just make a download script which bootstraps the users corpus folder. Could be a couple of wget lines or so ... Jörn
Re: Automated testing with public data
Well, ant is still an extra dependency, though better than wget. Something like Wagon in Maven? On 30 April 2015 at 11:02, Richard Eckart de Castilho richard.eck...@gmail.com wrote: Since OpenNLP is cross-platform/Java-based, something that works cross-platform/Java-based might be better than wget. I'm using Ant scripts for such tasks. -- Richard On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote: +1 The script would also be great for documentation. 2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com: Or we just make a download script which bootstraps the users corpus folder. Could be a couple of wget lines or so ... Jörn
Re: Automated testing with public data
Or we just make a download script which bootstraps the users corpus folder. Could be a couple of wget lines or so ... Jörn On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com wrote: Automating the download would be fine as long as we cache it, as Richard suggested. Maybe it could be done by a script to prepare the environment, and not be part of the unit test itself. Anyway, it would be a good idea to save the data somewhere because we never know if some of the websites will become unavailable in the future. 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho richard.eck...@gmail.com: On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote: With publicly accessible data I mean a corpus you can somehow acquire, opposed to the data you create on your own for a project. All the corpora we support in the formats package are publicly accessible. Maybe some you have to buy and for others you just have to sign some agreement. A very interesting corpus for testing (and training models on) is OntoNotes. Here is a link to the LDC entry: https://catalog.ldc.upenn.edu/LDC2011T03 You can get it for free (or for a small distribution fee) but you can't just download it. It would be great if the ASF could acquire this data set so we can share it among the committers. Is that what you mean with proprietary data? Yes, that is what I mean. E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed. Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage. So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable. Cheers, -- Richard
Re: Automated testing with public data
+1 The script would also be great for documentation. 2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com: Or we just make a download script which bootstraps the users corpus folder. Could be a couple of wget lines or so ... Jörn On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com wrote: Automating the download would be fine as long as we cache it, as Richard suggested. Maybe it could be done by a script to prepare the environment, and not be part of the unit test itself. Anyway, it would be a good idea to save the data somewhere because we never know if some of the websites will become unavailable in the future. 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho richard.eck...@gmail.com: On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote: With publicly accessible data I mean a corpus you can somehow acquire, opposed to the data you create on your own for a project. All the corpora we support in the formats package are publicly accessible. Maybe some you have to buy and for others you just have to sign some agreement. A very interesting corpus for testing (and training models on) is OntoNotes. Here is a link to the LDC entry: https://catalog.ldc.upenn.edu/LDC2011T03 You can get it for free (or for a small distribution fee) but you can't just download it. It would be great if the ASF could acquire this data set so we can share it among the committers. Is that what you mean with proprietary data? Yes, that is what I mean. E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed. Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage. So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable. Cheers, -- Richard
Re: Automated testing with public data
Automating the download would be fine as long as we cache it, as Richard suggested. Maybe it could be done by a script to prepare the environment, and not be part of the unit test itself. Anyway, it would be a good idea to save the data somewhere because we never know if some of the websites will become unavailable in the future. 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho richard.eck...@gmail.com: On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote: With publicly accessible data I mean a corpus you can somehow acquire, opposed to the data you create on your own for a project. All the corpora we support in the formats package are publicly accessible. Maybe some you have to buy and for others you just have to sign some agreement. A very interesting corpus for testing (and training models on) is OntoNotes. Here is a link to the LDC entry: https://catalog.ldc.upenn.edu/LDC2011T03 You can get it for free (or for a small distribution fee) but you can't just download it. It would be great if the ASF could acquire this data set so we can share it among the committers. Is that what you mean with proprietary data? Yes, that is what I mean. E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed. Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage. So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable. Cheers, -- Richard
Re: Automated testing with public data
On 15.04.2015, at 09:39, Joern Kottmann kottm...@gmail.com wrote: Some data sets are publicly available but protected by copyright and just can't be redistributed in anyway. For this data we could get/buy a license and maybe restrict access to it among the committers. That's what I'm saying ;) If you automatically download the data to a personal workstation during tests, you do not redistribute the data. For Jenkins builds, I just checked the Apache Jenkins and the Workspace does not seem to be publicly accessible. So stuff downloaded during tests there is also not made publicly available (redistributed) - it is only accessible to Apache developers that are logged in. IMHO only truely proprietary data that is not publicly accessible should be a problem, no? -- Richard
Re: Automated testing with public data
On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote: With publicly accessible data I mean a corpus you can somehow acquire, opposed to the data you create on your own for a project. All the corpora we support in the formats package are publicly accessible. Maybe some you have to buy and for others you just have to sign some agreement. A very interesting corpus for testing (and training models on) is OntoNotes. Here is a link to the LDC entry: https://catalog.ldc.upenn.edu/LDC2011T03 You can get it for free (or for a small distribution fee) but you can't just download it. It would be great if the ASF could acquire this data set so we can share it among the committers. Is that what you mean with proprietary data? Yes, that is what I mean. E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed. Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage. So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable. Cheers, -- Richard
Automated testing with public data
Hi all, this time the progress with the testing for 1.6.0 is rather slow. Most tests are done now and I believe we are in a good shape to build RC3. Anyway it would have bee better to be at that stage month ago. To improve the situation in the future I would like to propose to automate all tests which can be run against data which is publicly available. These tests are all set up following the same pattern, they train a component on a corpus and afterwards evaluate against it. If the results matches the result of the previous release we hope the code doesn't contain any regressions. In some cases we have changes which influence the performance (e.g. bug fixes) in that case we adjust the expected performance score and carefully test that a particular change caused it. We sometimes have changes which shouldn't influence the performance of a component but still do due to some mistakes. These we need to identify during testing. The big issue we have with testing against public data is that we usually can't include the data in the OpenNLP release because of their license. And today we just do all the work manually by training on a corpus and afterwards running the built in evaluation against the model. I suggest we write JUnit tests which are doing this in case the user has the right corpus for the test. Those tests will be disabled by default and can be run by providing the -Dtest property and the location of the data director. For example. mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data The tests will do all the work and fail if the expected results don't match. Automating those tests has the great advantage that we can run them much more frequently during the development phase and hopefully identify bugs before we even start with the release process. Addionally we might be able to run that on our build server. Any opinions? Jörn