Automating the download would be fine as long as we cache it, as Richard
suggested. Maybe it could be done by a script to prepare the environment,
and not be part of the unit test itself.
Anyway, it would be a good idea to save the data somewhere because we never
know if some of the websites will become unavailable in the future.
2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho
richard.eck...@gmail.com:
On 15.04.2015, at 10:23, Joern Kottmann kottm...@gmail.com wrote:
With publicly accessible data I mean a corpus you can somehow acquire,
opposed to the data you create on your own for a project.
All the corpora we support in the formats package are publicly
accessible.
Maybe
some you have to buy and for others you just have to sign some agreement.
A very interesting corpus for testing (and training models on) is
OntoNotes.
Here is a link to the LDC entry:
https://catalog.ldc.upenn.edu/LDC2011T03
You can get it for free (or for a small distribution fee) but you can't
just download it.
It would be great if the ASF could acquire this data set so we can share
it
among the committers.
Is that what you mean with proprietary data?
Yes, that is what I mean.
E.g. the TIGER corpus requires clicking through some pages and forms to
reach a download page, but in principle, it appears as if the corpus was
simply downloadable by a deep-link URL. The license terms state, that the
corpus must not be redistributed.
Some tools are also publicly accessible and downloadable but not
redistributable. For example anybody can download TreeTagger and its
models, but only from the original homepage. It is not permitted to
redistribute it, i.e. to publish it to a repository or offer it on an
alternative homepage.
So there is a (small) class of resources between being redistributable and
proprietary (for fee), namely being in principle publicly accessible (for
free) but not redistributable.
Cheers,
-- Richard