subject:"Automated testing with public data"

Re: Automated testing with public data

2015-04-30 Thread Richard Eckart de Castilho

Since OpenNLP is cross-platform/Java-based, something that works
cross-platform/Java-based might be better than wget. 
I'm using Ant scripts for such tasks.

-- Richard

On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote:

 +1
 
 The script would also be great for documentation.
 
 2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:
 
 Or we just make a download script which bootstraps the users corpus folder.
 
 Could be a couple of wget lines or so ...
 
 
 Jörn

Re: Automated testing with public data

2015-04-30 Thread Aliaksandr Autayeu

Well, ant is still an extra dependency, though better than wget. Something
like Wagon in Maven?

On 30 April 2015 at 11:02, Richard Eckart de Castilho 
richard.eck...@gmail.com wrote:

 Since OpenNLP is cross-platform/Java-based, something that works
 cross-platform/Java-based might be better than wget.
 I'm using Ant scripts for such tasks.

 -- Richard

 On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote:

  +1
 
  The script would also be great for documentation.
 
  2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:
 
  Or we just make a download script which bootstraps the users corpus
 folder.
 
  Could be a couple of wget lines or so ...
 
 
  Jörn

Re: Automated testing with public data

2015-04-29 Thread Joern Kottmann

Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
wrote:

 Automating the download would be fine as long as we cache it, as Richard
 suggested. Maybe it could be done by a script to prepare the environment,
 and not be part of the unit test itself.
 Anyway, it would be a good idea to save the data somewhere because we never
 know if some of the websites will become unavailable in the future.


 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
 richard.eck...@gmail.com:

  On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
 
   With publicly accessible data I mean a corpus you can somehow acquire,
   opposed to the data you create on your own for a project.
  
   All the corpora we support in the formats package are publicly
  accessible.
   Maybe
   some you have to buy and for others you just have to sign some
 agreement.
  
   A very interesting corpus for testing (and training models on) is
  OntoNotes.
  
   Here is a link to the LDC entry:
   https://catalog.ldc.upenn.edu/LDC2011T03
  
   You can get it for free (or for a small distribution fee) but you can't
   just download it.
   It would be great if the ASF could acquire this data set so we can
 share
  it
   among the committers.
  
   Is that what you mean with proprietary data?
 
  Yes, that is what I mean.
 
  E.g. the TIGER corpus requires clicking through some pages and forms to
  reach a download page, but in principle, it appears as if the corpus was
  simply downloadable by a deep-link URL. The license terms state, that the
  corpus must not be redistributed.
 
  Some tools are also publicly accessible and downloadable but not
  redistributable. For example anybody can download TreeTagger and its
  models, but only from the original homepage. It is not permitted to
  redistribute it, i.e. to publish it to a repository or offer it on an
  alternative homepage.
 
  So there is a (small) class of resources between being redistributable
 and
  proprietary (for fee), namely being in principle publicly accessible (for
  free) but not redistributable.
 
  Cheers,
 
  -- Richard

Re: Automated testing with public data

2015-04-29 Thread William Colen

+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:

 Or we just make a download script which bootstraps the users corpus folder.

 Could be a couple of wget lines or so ...


 Jörn

 On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
 wrote:

  Automating the download would be fine as long as we cache it, as Richard
  suggested. Maybe it could be done by a script to prepare the environment,
  and not be part of the unit test itself.
  Anyway, it would be a good idea to save the data somewhere because we
 never
  know if some of the websites will become unavailable in the future.
 
 
  2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
  richard.eck...@gmail.com:
 
   On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
  
With publicly accessible data I mean a corpus you can somehow
 acquire,
opposed to the data you create on your own for a project.
   
All the corpora we support in the formats package are publicly
   accessible.
Maybe
some you have to buy and for others you just have to sign some
  agreement.
   
A very interesting corpus for testing (and training models on) is
   OntoNotes.
   
Here is a link to the LDC entry:
https://catalog.ldc.upenn.edu/LDC2011T03
   
You can get it for free (or for a small distribution fee) but you
 can't
just download it.
It would be great if the ASF could acquire this data set so we can
  share
   it
among the committers.
   
Is that what you mean with proprietary data?
  
   Yes, that is what I mean.
  
   E.g. the TIGER corpus requires clicking through some pages and forms to
   reach a download page, but in principle, it appears as if the corpus
 was
   simply downloadable by a deep-link URL. The license terms state, that
 the
   corpus must not be redistributed.
  
   Some tools are also publicly accessible and downloadable but not
   redistributable. For example anybody can download TreeTagger and its
   models, but only from the original homepage. It is not permitted to
   redistribute it, i.e. to publish it to a repository or offer it on an
   alternative homepage.
  
   So there is a (small) class of resources between being redistributable
  and
   proprietary (for fee), namely being in principle publicly accessible
 (for
   free) but not redistributable.
  
   Cheers,
  
   -- Richard

Re: Automated testing with public data

2015-04-28 Thread William Colen

Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.


2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
richard.eck...@gmail.com:

 On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:

  With publicly accessible data I mean a corpus you can somehow acquire,
  opposed to the data you create on your own for a project.
 
  All the corpora we support in the formats package are publicly
 accessible.
  Maybe
  some you have to buy and for others you just have to sign some agreement.
 
  A very interesting corpus for testing (and training models on) is
 OntoNotes.
 
  Here is a link to the LDC entry:
  https://catalog.ldc.upenn.edu/LDC2011T03
 
  You can get it for free (or for a small distribution fee) but you can't
  just download it.
  It would be great if the ASF could acquire this data set so we can share
 it
  among the committers.
 
  Is that what you mean with proprietary data?

 Yes, that is what I mean.

 E.g. the TIGER corpus requires clicking through some pages and forms to
 reach a download page, but in principle, it appears as if the corpus was
 simply downloadable by a deep-link URL. The license terms state, that the
 corpus must not be redistributed.

 Some tools are also publicly accessible and downloadable but not
 redistributable. For example anybody can download TreeTagger and its
 models, but only from the original homepage. It is not permitted to
 redistribute it, i.e. to publish it to a repository or offer it on an
 alternative homepage.

 So there is a (small) class of resources between being redistributable and
 proprietary (for fee), namely being in principle publicly accessible (for
 free) but not redistributable.

 Cheers,

 -- Richard

Re: Automated testing with public data

2015-04-15 Thread Richard Eckart de Castilho

On 15.04.2015, at 09:39, Joern Kottmann kottm...@gmail.com wrote:

 Some data sets are publicly available but protected by copyright and just
 can't be redistributed in
 anyway. For this data we could get/buy a license and maybe restrict access
 to it among the committers.

That's what I'm saying ;) If you automatically download the data to a personal
workstation during tests, you do not redistribute the data.

For Jenkins builds, I just checked the Apache Jenkins and the Workspace does
not seem to be publicly accessible. So stuff downloaded during tests there is
also not made publicly available (redistributed) - it is only accessible to
Apache developers that are logged in. 

IMHO only truely proprietary data that is not publicly accessible should be
a problem, no?

-- Richard

Re: Automated testing with public data

2015-04-15 Thread Richard Eckart de Castilho

On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:

 With publicly accessible data I mean a corpus you can somehow acquire,
 opposed to the data you create on your own for a project.
 
 All the corpora we support in the formats package are publicly accessible.
 Maybe
 some you have to buy and for others you just have to sign some agreement.
 
 A very interesting corpus for testing (and training models on) is OntoNotes.
 
 Here is a link to the LDC entry:
 https://catalog.ldc.upenn.edu/LDC2011T03
 
 You can get it for free (or for a small distribution fee) but you can't
 just download it.
 It would be great if the ASF could acquire this data set so we can share it
 among the committers.
 
 Is that what you mean with proprietary data?

Yes, that is what I mean.

E.g. the TIGER corpus requires clicking through some pages and forms to reach a 
download page, but in principle, it appears as if the corpus was simply 
downloadable by a deep-link URL. The license terms state, that the corpus must 
not be redistributed.

Some tools are also publicly accessible and downloadable but not 
redistributable. For example anybody can download TreeTagger and its models, 
but only from the original homepage. It is not permitted to redistribute it, 
i.e. to publish it to a repository or offer it on an alternative homepage.

So there is a (small) class of resources between being redistributable and 
proprietary (for fee), namely being in principle publicly accessible (for free) 
but not redistributable.

Cheers,

-- Richard

Automated testing with public data

2015-04-14 Thread Joern Kottmann

Hi all,

this time the progress with the testing for 1.6.0 is rather slow. Most
tests are done now and I believe we are in a good shape to build RC3.
Anyway it would have bee better to be at that stage month ago.

To improve the situation in the future I would like to propose to automate
all tests which can be run against data which is publicly available. These
tests are all set up following the same pattern, they train a component on
a corpus and afterwards evaluate against it. If the results matches the
result of the previous release we hope the code doesn't contain any
regressions. In some cases we have changes which influence the performance
(e.g. bug fixes) in that case we adjust the expected performance score and
carefully test that a particular change caused it.

We sometimes have changes which shouldn't influence the performance of a
component but still do due to some mistakes. These we need to identify
during testing.

The big issue we have with testing against public data is that we usually
can't include the data in the OpenNLP release because of their license. And
today we just do all the work manually by training on a corpus and
afterwards running the built in evaluation against the model.

I suggest we write JUnit tests which are doing this in case the user has
the right corpus for the test. Those tests will be disabled by default and
can be run by providing the -Dtest property and the location of the data
director.

For example.
mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data

The tests will do all the work and fail if the expected results don't match.

Automating those tests has the great advantage that we can run them much
more frequently during the development phase and hopefully identify bugs
before we even start with the release process.
Addionally we might be able to run that on our build server.

Any opinions?

Jörn

Re: Automated testing with public data

Re: Automated testing with public data

Re: Automated testing with public data

Re: Automated testing with public data

Re: Automated testing with public data

Re: Automated testing with public data

Re: Automated testing with public data

Automated testing with public data

8 matches

Site Navigation

Mail list logo

Footer information