Thanks for sharing. This probably applies to a sub-set of the data we can use. And all data we somehow can get into subversion we should definitely check-in.
Some data sets are publicly available but protected by copyright and just can't be redistributed in anyway. For this data we could get/buy a license and maybe restrict access to it among the committers. Jörn On Tue, Apr 14, 2015 at 11:53 PM, Richard Eckart de Castilho < richard.eck...@gmail.com> wrote: > If the unit test automatically download publicly accessible test data, > run the tests, and optionally delete the data afterwards, then the > test data does not have to be redistributed. Instead of deleting, > it might be even a good idea to cache the data to a) avoid hammering > the remote source and b) still have a local copy in case the source > fails. > > I believe several cases have been discussed on the legal mailing list > where non-essential or test-only resources that were not part of the > release could be under licenses that would not be deemed compatible > with the Apache license. My understanding is that the release needs > to be untained and the downstream users must be able to trust that > they incur no license restrictions beyond the ASL. > > Cheers, > > -- Richard > > On 14.04.2015, at 23:47, Joern Kottmann <kottm...@gmail.com> wrote: > > > Hi all, > > > > this time the progress with the testing for 1.6.0 is rather slow. Most > > tests are done now and I believe we are in a good shape to build RC3. > > Anyway it would have bee better to be at that stage month ago. > > > > To improve the situation in the future I would like to propose to > automate > > all tests which can be run against data which is publicly available. > These > > tests are all set up following the same pattern, they train a component > on > > a corpus and afterwards evaluate against it. If the results matches the > > result of the previous release we hope the code doesn't contain any > > regressions. In some cases we have changes which influence the > performance > > (e.g. bug fixes) in that case we adjust the expected performance score > and > > carefully test that a particular change caused it. > > > > We sometimes have changes which shouldn't influence the performance of a > > component but still do due to some mistakes. These we need to identify > > during testing. > > > > The big issue we have with testing against public data is that we usually > > can't include the data in the OpenNLP release because of their license. > And > > today we just do all the work manually by training on a corpus and > > afterwards running the built in evaluation against the model. > > > > I suggest we write JUnit tests which are doing this in case the user has > > the right corpus for the test. Those tests will be disabled by default > and > > can be run by providing the -Dtest property and the location of the data > > director. > > > > For example. > > mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data > > > > The tests will do all the work and fail if the expected results don't > match. > > > > Automating those tests has the great advantage that we can run them much > > more frequently during the development phase and hopefully identify bugs > > before we even start with the release process. > > Addionally we might be able to run that on our build server. > > > > Any opinions? > > > > Jörn > >