+1 The script would also be great for documentation.
2015-04-29 11:15 GMT-03:00 Joern Kottmann <kottm...@gmail.com>: > Or we just make a download script which bootstraps the users corpus folder. > > Could be a couple of wget lines or so ... > > > Jörn > > On Wed, Apr 29, 2015 at 6:17 AM, William Colen <william.co...@gmail.com> > wrote: > > > Automating the download would be fine as long as we cache it, as Richard > > suggested. Maybe it could be done by a script to prepare the environment, > > and not be part of the unit test itself. > > Anyway, it would be a good idea to save the data somewhere because we > never > > know if some of the websites will become unavailable in the future. > > > > > > 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho < > > richard.eck...@gmail.com>: > > > > > On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote: > > > > > > > With publicly accessible data I mean a corpus you can somehow > acquire, > > > > opposed to the data you create on your own for a project. > > > > > > > > All the corpora we support in the formats package are publicly > > > accessible. > > > > Maybe > > > > some you have to buy and for others you just have to sign some > > agreement. > > > > > > > > A very interesting corpus for testing (and training models on) is > > > OntoNotes. > > > > > > > > Here is a link to the LDC entry: > > > > https://catalog.ldc.upenn.edu/LDC2011T03 > > > > > > > > You can get it for free (or for a small distribution fee) but you > can't > > > > just download it. > > > > It would be great if the ASF could acquire this data set so we can > > share > > > it > > > > among the committers. > > > > > > > > Is that what you mean with proprietary data? > > > > > > Yes, that is what I mean. > > > > > > E.g. the TIGER corpus requires clicking through some pages and forms to > > > reach a download page, but in principle, it appears as if the corpus > was > > > simply downloadable by a deep-link URL. The license terms state, that > the > > > corpus must not be redistributed. > > > > > > Some tools are also publicly accessible and downloadable but not > > > redistributable. For example anybody can download TreeTagger and its > > > models, but only from the original homepage. It is not permitted to > > > redistribute it, i.e. to publish it to a repository or offer it on an > > > alternative homepage. > > > > > > So there is a (small) class of resources between being redistributable > > and > > > proprietary (for fee), namely being in principle publicly accessible > (for > > > free) but not redistributable. > > > > > > Cheers, > > > > > > -- Richard > > >