Maybe we can then even distribute these models from Apache.
But in any case we should implement format support for the corpus,
so that training OpenNLP on it is easy.

Jörn

On 08/09/2012 03:45 AM, Jason Baldridge wrote:
There is a link to a pre-release of the MASC data that I have but am not
sure I can share. I believe they are planning to have a finalized version
out in September.

AFAIK, the MASC data is unencumbered -- Nancy Ide is very committed to
having truly open data and annotations. It would be great if the community
can give back to the OANC with further annotations, tools, and such -- some
of the annotation stuff being discussed here would could be great for this.

On Wed, Aug 8, 2012 at 7:47 PM, James Kosin <james.ko...@gmail.com> wrote:

http://www.anc.org/

... but, this suggests the data they collect is only for research and
education.

On 8/8/2012 10:31 AM, Jason Baldridge wrote:
Sorry if I missed something along the way -- who did the annotation of
the
Wikipedia data?

BTW, the OANC will soon come out with their 3.0 release of MASC (the
Manually Annotated Sub-Corpus), with about 800k tokens of English text
(multiple domains, including twitter, blogs, transcribed spoken, and
more)
labeled with several different levels of analysis, including chunks (noun
and verb), entities, tokens, POS tags, sentence boundaries, and logical
forms.

http://www.americannationalcorpus.org/MASC/Home.html

On Wed, Aug 8, 2012 at 2:47 AM, Jörn Kottmann <kottm...@gmail.com>
wrote:
On 08/08/2012 06:16 AM, Michael Schmitz wrote:

Hi, here are some models trained on Wikipedia data.  They have similar
performance.  Is this useful?

Yes, people who do not have access to our MUC based training
data can just use the wiki data instead and combine it with their data.

Thanks for sharing.

Now all we need is a way to get label corrections from the community :-)

Jörn





Reply via email to