Re: Automated testing with public data

2015-04-29 Thread William Colen
+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann :

> Or we just make a download script which bootstraps the users corpus folder.
>
> Could be a couple of wget lines or so ...
>
>
> Jörn
>
> On Wed, Apr 29, 2015 at 6:17 AM, William Colen 
> wrote:
>
> > Automating the download would be fine as long as we cache it, as Richard
> > suggested. Maybe it could be done by a script to prepare the environment,
> > and not be part of the unit test itself.
> > Anyway, it would be a good idea to save the data somewhere because we
> never
> > know if some of the websites will become unavailable in the future.
> >
> >
> > 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> > richard.eck...@gmail.com>:
> >
> > > On 15.04.2015, at 10:23, Joern Kottmann  wrote:
> > >
> > > > With publicly accessible data I mean a corpus you can somehow
> acquire,
> > > > opposed to the data you create on your own for a project.
> > > >
> > > > All the corpora we support in the formats package are publicly
> > > accessible.
> > > > Maybe
> > > > some you have to buy and for others you just have to sign some
> > agreement.
> > > >
> > > > A very interesting corpus for testing (and training models on) is
> > > OntoNotes.
> > > >
> > > > Here is a link to the LDC entry:
> > > > https://catalog.ldc.upenn.edu/LDC2011T03
> > > >
> > > > You can get it for free (or for a small distribution fee) but you
> can't
> > > > just download it.
> > > > It would be great if the ASF could acquire this data set so we can
> > share
> > > it
> > > > among the committers.
> > > >
> > > > Is that what you mean with proprietary data?
> > >
> > > Yes, that is what I mean.
> > >
> > > E.g. the TIGER corpus requires clicking through some pages and forms to
> > > reach a download page, but in principle, it appears as if the corpus
> was
> > > simply downloadable by a deep-link URL. The license terms state, that
> the
> > > corpus must not be redistributed.
> > >
> > > Some tools are also publicly accessible and downloadable but not
> > > redistributable. For example anybody can download TreeTagger and its
> > > models, but only from the original homepage. It is not permitted to
> > > redistribute it, i.e. to publish it to a repository or offer it on an
> > > alternative homepage.
> > >
> > > So there is a (small) class of resources between being redistributable
> > and
> > > proprietary (for fee), namely being in principle publicly accessible
> (for
> > > free) but not redistributable.
> > >
> > > Cheers,
> > >
> > > -- Richard
> >
>


Re: Automated testing with public data

2015-04-29 Thread Joern Kottmann
Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen 
wrote:

> Automating the download would be fine as long as we cache it, as Richard
> suggested. Maybe it could be done by a script to prepare the environment,
> and not be part of the unit test itself.
> Anyway, it would be a good idea to save the data somewhere because we never
> know if some of the websites will become unavailable in the future.
>
>
> 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> richard.eck...@gmail.com>:
>
> > On 15.04.2015, at 10:23, Joern Kottmann  wrote:
> >
> > > With publicly accessible data I mean a corpus you can somehow acquire,
> > > opposed to the data you create on your own for a project.
> > >
> > > All the corpora we support in the formats package are publicly
> > accessible.
> > > Maybe
> > > some you have to buy and for others you just have to sign some
> agreement.
> > >
> > > A very interesting corpus for testing (and training models on) is
> > OntoNotes.
> > >
> > > Here is a link to the LDC entry:
> > > https://catalog.ldc.upenn.edu/LDC2011T03
> > >
> > > You can get it for free (or for a small distribution fee) but you can't
> > > just download it.
> > > It would be great if the ASF could acquire this data set so we can
> share
> > it
> > > among the committers.
> > >
> > > Is that what you mean with proprietary data?
> >
> > Yes, that is what I mean.
> >
> > E.g. the TIGER corpus requires clicking through some pages and forms to
> > reach a download page, but in principle, it appears as if the corpus was
> > simply downloadable by a deep-link URL. The license terms state, that the
> > corpus must not be redistributed.
> >
> > Some tools are also publicly accessible and downloadable but not
> > redistributable. For example anybody can download TreeTagger and its
> > models, but only from the original homepage. It is not permitted to
> > redistribute it, i.e. to publish it to a repository or offer it on an
> > alternative homepage.
> >
> > So there is a (small) class of resources between being redistributable
> and
> > proprietary (for fee), namely being in principle publicly accessible (for
> > free) but not redistributable.
> >
> > Cheers,
> >
> > -- Richard
>