Thanks for sharing. This probably applies to a sub-set of the data we can
use.
And all data we somehow can get into subversion we should definitely
check-in.

Some data sets are publicly available but protected by copyright and just
can't be redistributed in
anyway. For this data we could get/buy a license and maybe restrict access
to it among the committers.

Jörn


On Tue, Apr 14, 2015 at 11:53 PM, Richard Eckart de Castilho <
richard.eck...@gmail.com> wrote:

> If the unit test automatically download publicly accessible test data,
> run the tests, and optionally delete the data afterwards, then the
> test data does not have to be redistributed. Instead of deleting,
> it might be even a good idea to cache the data to a) avoid hammering
> the remote source and b) still have a local copy in case the source
> fails.
>
> I believe several cases have been discussed on the legal mailing list
> where non-essential or test-only resources that were not part of the
> release could be under licenses that would not be deemed compatible
> with the Apache license. My understanding is that the release needs
> to be untained and the downstream users must be able to trust that
> they incur no license restrictions beyond the ASL.
>
> Cheers,
>
> -- Richard
>
> On 14.04.2015, at 23:47, Joern Kottmann <kottm...@gmail.com> wrote:
>
> > Hi all,
> >
> > this time the progress with the testing for 1.6.0 is rather slow. Most
> > tests are done now and I believe we are in a good shape to build RC3.
> > Anyway it would have bee better to be at that stage month ago.
> >
> > To improve the situation in the future I would like to propose to
> automate
> > all tests which can be run against data which is publicly available.
> These
> > tests are all set up following the same pattern, they train a component
> on
> > a corpus and afterwards evaluate against it. If the results matches the
> > result of the previous release we hope the code doesn't contain any
> > regressions. In some cases we have changes which influence the
> performance
> > (e.g. bug fixes) in that case we adjust the expected performance score
> and
> > carefully test that a particular change caused it.
> >
> > We sometimes have changes which shouldn't influence the performance of a
> > component but still do due to some mistakes. These we need to identify
> > during testing.
> >
> > The big issue we have with testing against public data is that we usually
> > can't include the data in the OpenNLP release because of their license.
> And
> > today we just do all the work manually by training on a corpus and
> > afterwards running the built in evaluation against the model.
> >
> > I suggest we write JUnit tests which are doing this in case the user has
> > the right corpus for the test. Those tests will be disabled by default
> and
> > can be run by providing the -Dtest property and the location of the data
> > director.
> >
> > For example.
> > mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data
> >
> > The tests will do all the work and fail if the expected results don't
> match.
> >
> > Automating those tests has the great advantage that we can run them much
> > more frequently during the development phase and hopefully identify bugs
> > before we even start with the release process.
> > Addionally we might be able to run that on our build server.
> >
> > Any opinions?
> >
> > Jörn
>
>

Reply via email to