Re: Automated testing with public data

2015-04-30 Thread Richard Eckart de Castilho
Since OpenNLP is cross-platform/Java-based, something that works
cross-platform/Java-based might be better than wget. 
I'm using Ant scripts for such tasks.

-- Richard

On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote:

 +1
 
 The script would also be great for documentation.
 
 2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:
 
 Or we just make a download script which bootstraps the users corpus folder.
 
 Could be a couple of wget lines or so ...
 
 
 Jörn



Re: Automated testing with public data

2015-04-30 Thread Aliaksandr Autayeu
Well, ant is still an extra dependency, though better than wget. Something
like Wagon in Maven?

On 30 April 2015 at 11:02, Richard Eckart de Castilho 
richard.eck...@gmail.com wrote:

 Since OpenNLP is cross-platform/Java-based, something that works
 cross-platform/Java-based might be better than wget.
 I'm using Ant scripts for such tasks.

 -- Richard

 On 29.04.2015, at 17:11, William Colen william.co...@gmail.com wrote:

  +1
 
  The script would also be great for documentation.
 
  2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:
 
  Or we just make a download script which bootstraps the users corpus
 folder.
 
  Could be a couple of wget lines or so ...
 
 
  Jörn




Re: Automated testing with public data

2015-04-29 Thread Joern Kottmann
Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
wrote:

 Automating the download would be fine as long as we cache it, as Richard
 suggested. Maybe it could be done by a script to prepare the environment,
 and not be part of the unit test itself.
 Anyway, it would be a good idea to save the data somewhere because we never
 know if some of the websites will become unavailable in the future.


 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
 richard.eck...@gmail.com:

  On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
 
   With publicly accessible data I mean a corpus you can somehow acquire,
   opposed to the data you create on your own for a project.
  
   All the corpora we support in the formats package are publicly
  accessible.
   Maybe
   some you have to buy and for others you just have to sign some
 agreement.
  
   A very interesting corpus for testing (and training models on) is
  OntoNotes.
  
   Here is a link to the LDC entry:
   https://catalog.ldc.upenn.edu/LDC2011T03
  
   You can get it for free (or for a small distribution fee) but you can't
   just download it.
   It would be great if the ASF could acquire this data set so we can
 share
  it
   among the committers.
  
   Is that what you mean with proprietary data?
 
  Yes, that is what I mean.
 
  E.g. the TIGER corpus requires clicking through some pages and forms to
  reach a download page, but in principle, it appears as if the corpus was
  simply downloadable by a deep-link URL. The license terms state, that the
  corpus must not be redistributed.
 
  Some tools are also publicly accessible and downloadable but not
  redistributable. For example anybody can download TreeTagger and its
  models, but only from the original homepage. It is not permitted to
  redistribute it, i.e. to publish it to a repository or offer it on an
  alternative homepage.
 
  So there is a (small) class of resources between being redistributable
 and
  proprietary (for fee), namely being in principle publicly accessible (for
  free) but not redistributable.
 
  Cheers,
 
  -- Richard



Re: Automated testing with public data

2015-04-29 Thread William Colen
+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann kottm...@gmail.com:

 Or we just make a download script which bootstraps the users corpus folder.

 Could be a couple of wget lines or so ...


 Jörn

 On Wed, Apr 29, 2015 at 6:17 AM, William Colen william.co...@gmail.com
 wrote:

  Automating the download would be fine as long as we cache it, as Richard
  suggested. Maybe it could be done by a script to prepare the environment,
  and not be part of the unit test itself.
  Anyway, it would be a good idea to save the data somewhere because we
 never
  know if some of the websites will become unavailable in the future.
 
 
  2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
  richard.eck...@gmail.com:
 
   On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
  
With publicly accessible data I mean a corpus you can somehow
 acquire,
opposed to the data you create on your own for a project.
   
All the corpora we support in the formats package are publicly
   accessible.
Maybe
some you have to buy and for others you just have to sign some
  agreement.
   
A very interesting corpus for testing (and training models on) is
   OntoNotes.
   
Here is a link to the LDC entry:
https://catalog.ldc.upenn.edu/LDC2011T03
   
You can get it for free (or for a small distribution fee) but you
 can't
just download it.
It would be great if the ASF could acquire this data set so we can
  share
   it
among the committers.
   
Is that what you mean with proprietary data?
  
   Yes, that is what I mean.
  
   E.g. the TIGER corpus requires clicking through some pages and forms to
   reach a download page, but in principle, it appears as if the corpus
 was
   simply downloadable by a deep-link URL. The license terms state, that
 the
   corpus must not be redistributed.
  
   Some tools are also publicly accessible and downloadable but not
   redistributable. For example anybody can download TreeTagger and its
   models, but only from the original homepage. It is not permitted to
   redistribute it, i.e. to publish it to a repository or offer it on an
   alternative homepage.
  
   So there is a (small) class of resources between being redistributable
  and
   proprietary (for fee), namely being in principle publicly accessible
 (for
   free) but not redistributable.
  
   Cheers,
  
   -- Richard
 



Re: Automated testing with public data

2015-04-28 Thread William Colen
Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.


2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho 
richard.eck...@gmail.com:

 On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:

  With publicly accessible data I mean a corpus you can somehow acquire,
  opposed to the data you create on your own for a project.
 
  All the corpora we support in the formats package are publicly
 accessible.
  Maybe
  some you have to buy and for others you just have to sign some agreement.
 
  A very interesting corpus for testing (and training models on) is
 OntoNotes.
 
  Here is a link to the LDC entry:
  https://catalog.ldc.upenn.edu/LDC2011T03
 
  You can get it for free (or for a small distribution fee) but you can't
  just download it.
  It would be great if the ASF could acquire this data set so we can share
 it
  among the committers.
 
  Is that what you mean with proprietary data?

 Yes, that is what I mean.

 E.g. the TIGER corpus requires clicking through some pages and forms to
 reach a download page, but in principle, it appears as if the corpus was
 simply downloadable by a deep-link URL. The license terms state, that the
 corpus must not be redistributed.

 Some tools are also publicly accessible and downloadable but not
 redistributable. For example anybody can download TreeTagger and its
 models, but only from the original homepage. It is not permitted to
 redistribute it, i.e. to publish it to a repository or offer it on an
 alternative homepage.

 So there is a (small) class of resources between being redistributable and
 proprietary (for fee), namely being in principle publicly accessible (for
 free) but not redistributable.

 Cheers,

 -- Richard


Re: Automated testing with public data

2015-04-15 Thread Richard Eckart de Castilho
On 15.04.2015, at 09:39, Joern Kottmann kottm...@gmail.com wrote:

 Some data sets are publicly available but protected by copyright and just
 can't be redistributed in
 anyway. For this data we could get/buy a license and maybe restrict access
 to it among the committers.

That's what I'm saying ;) If you automatically download the data to a personal
workstation during tests, you do not redistribute the data.

For Jenkins builds, I just checked the Apache Jenkins and the Workspace does
not seem to be publicly accessible. So stuff downloaded during tests there is
also not made publicly available (redistributed) - it is only accessible to
Apache developers that are logged in. 

IMHO only truely proprietary data that is not publicly accessible should be
a problem, no?

-- Richard

Re: Automated testing with public data

2015-04-15 Thread Richard Eckart de Castilho
On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:

 With publicly accessible data I mean a corpus you can somehow acquire,
 opposed to the data you create on your own for a project.
 
 All the corpora we support in the formats package are publicly accessible.
 Maybe
 some you have to buy and for others you just have to sign some agreement.
 
 A very interesting corpus for testing (and training models on) is OntoNotes.
 
 Here is a link to the LDC entry:
 https://catalog.ldc.upenn.edu/LDC2011T03
 
 You can get it for free (or for a small distribution fee) but you can't
 just download it.
 It would be great if the ASF could acquire this data set so we can share it
 among the committers.
 
 Is that what you mean with proprietary data?

Yes, that is what I mean.

E.g. the TIGER corpus requires clicking through some pages and forms to reach a 
download page, but in principle, it appears as if the corpus was simply 
downloadable by a deep-link URL. The license terms state, that the corpus must 
not be redistributed.

Some tools are also publicly accessible and downloadable but not 
redistributable. For example anybody can download TreeTagger and its models, 
but only from the original homepage. It is not permitted to redistribute it, 
i.e. to publish it to a repository or offer it on an alternative homepage.

So there is a (small) class of resources between being redistributable and 
proprietary (for fee), namely being in principle publicly accessible (for free) 
but not redistributable.

Cheers,

-- Richard