Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.


2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
richard.eck...@gmail.com>:

> On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote:
>
> > With publicly accessible data I mean a corpus you can somehow acquire,
> > opposed to the data you create on your own for a project.
> >
> > All the corpora we support in the formats package are publicly
> accessible.
> > Maybe
> > some you have to buy and for others you just have to sign some agreement.
> >
> > A very interesting corpus for testing (and training models on) is
> OntoNotes.
> >
> > Here is a link to the LDC entry:
> > https://catalog.ldc.upenn.edu/LDC2011T03
> >
> > You can get it for free (or for a small distribution fee) but you can't
> > just download it.
> > It would be great if the ASF could acquire this data set so we can share
> it
> > among the committers.
> >
> > Is that what you mean with proprietary data?
>
> Yes, that is what I mean.
>
> E.g. the TIGER corpus requires clicking through some pages and forms to
> reach a download page, but in principle, it appears as if the corpus was
> simply downloadable by a deep-link URL. The license terms state, that the
> corpus must not be redistributed.
>
> Some tools are also publicly accessible and downloadable but not
> redistributable. For example anybody can download TreeTagger and its
> models, but only from the original homepage. It is not permitted to
> redistribute it, i.e. to publish it to a repository or offer it on an
> alternative homepage.
>
> So there is a (small) class of resources between being redistributable and
> proprietary (for fee), namely being in principle publicly accessible (for
> free) but not redistributable.
>
> Cheers,
>
> -- Richard

Reply via email to