On 8/8/2012 10:31 AM, Jason Baldridge wrote:
> Sorry if I missed something along the way -- who did the annotation of the
> Wikipedia data?
>
> BTW, the OANC will soon come out with their 3.0 release of MASC (the
> Manually Annotated Sub-Corpus), with about 800k tokens of English text
> (multiple domains, including twitter, blogs, transcribed spoken, and more)
> labeled with several different levels of analysis, including chunks (noun
> and verb), entities, tokens, POS tags, sentence boundaries, and logical
> forms.
>
> http://www.americannationalcorpus.org/MASC/Home.html
>
>
Jason,

It looks interesting; but, they only provide annotations for the data
with 80K words right now.  They have data-sets for the others only.  :-(

But, they provide a subset of 40K words in CoNNL 08 format.

With our architecture, it doesn't matter much on the format it is just
on getting a converter to extract the data we need.  Looks like we could
even train tokenizer and sentence detector on the structure provided.

Reply via email to