Automating the download would be fine as long as we cache it, as Richard suggested. Maybe it could be done by a script to prepare the environment, and not be part of the unit test itself. Anyway, it would be a good idea to save the data somewhere because we never know if some of the websites will become unavailable in the future.
2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho < richard.eck...@gmail.com>: > On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote: > > > With publicly accessible data I mean a corpus you can somehow acquire, > > opposed to the data you create on your own for a project. > > > > All the corpora we support in the formats package are publicly > accessible. > > Maybe > > some you have to buy and for others you just have to sign some agreement. > > > > A very interesting corpus for testing (and training models on) is > OntoNotes. > > > > Here is a link to the LDC entry: > > https://catalog.ldc.upenn.edu/LDC2011T03 > > > > You can get it for free (or for a small distribution fee) but you can't > > just download it. > > It would be great if the ASF could acquire this data set so we can share > it > > among the committers. > > > > Is that what you mean with proprietary data? > > Yes, that is what I mean. > > E.g. the TIGER corpus requires clicking through some pages and forms to > reach a download page, but in principle, it appears as if the corpus was > simply downloadable by a deep-link URL. The license terms state, that the > corpus must not be redistributed. > > Some tools are also publicly accessible and downloadable but not > redistributable. For example anybody can download TreeTagger and its > models, but only from the original homepage. It is not permitted to > redistribute it, i.e. to publish it to a repository or offer it on an > alternative homepage. > > So there is a (small) class of resources between being redistributable and > proprietary (for fee), namely being in principle publicly accessible (for > free) but not redistributable. > > Cheers, > > -- Richard