+1

The script would also be great for documentation.

2015-04-29 11:15 GMT-03:00 Joern Kottmann <kottm...@gmail.com>:

> Or we just make a download script which bootstraps the users corpus folder.
>
> Could be a couple of wget lines or so ...
>
>
> Jörn
>
> On Wed, Apr 29, 2015 at 6:17 AM, William Colen <william.co...@gmail.com>
> wrote:
>
> > Automating the download would be fine as long as we cache it, as Richard
> > suggested. Maybe it could be done by a script to prepare the environment,
> > and not be part of the unit test itself.
> > Anyway, it would be a good idea to save the data somewhere because we
> never
> > know if some of the websites will become unavailable in the future.
> >
> >
> > 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> > richard.eck...@gmail.com>:
> >
> > > On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote:
> > >
> > > > With publicly accessible data I mean a corpus you can somehow
> acquire,
> > > > opposed to the data you create on your own for a project.
> > > >
> > > > All the corpora we support in the formats package are publicly
> > > accessible.
> > > > Maybe
> > > > some you have to buy and for others you just have to sign some
> > agreement.
> > > >
> > > > A very interesting corpus for testing (and training models on) is
> > > OntoNotes.
> > > >
> > > > Here is a link to the LDC entry:
> > > > https://catalog.ldc.upenn.edu/LDC2011T03
> > > >
> > > > You can get it for free (or for a small distribution fee) but you
> can't
> > > > just download it.
> > > > It would be great if the ASF could acquire this data set so we can
> > share
> > > it
> > > > among the committers.
> > > >
> > > > Is that what you mean with proprietary data?
> > >
> > > Yes, that is what I mean.
> > >
> > > E.g. the TIGER corpus requires clicking through some pages and forms to
> > > reach a download page, but in principle, it appears as if the corpus
> was
> > > simply downloadable by a deep-link URL. The license terms state, that
> the
> > > corpus must not be redistributed.
> > >
> > > Some tools are also publicly accessible and downloadable but not
> > > redistributable. For example anybody can download TreeTagger and its
> > > models, but only from the original homepage. It is not permitted to
> > > redistribute it, i.e. to publish it to a repository or offer it on an
> > > alternative homepage.
> > >
> > > So there is a (small) class of resources between being redistributable
> > and
> > > proprietary (for fee), namely being in principle publicly accessible
> (for
> > > free) but not redistributable.
> > >
> > > Cheers,
> > >
> > > -- Richard
> >
>

Reply via email to