On 8/8/2012 10:31 AM, Jason Baldridge wrote: > Sorry if I missed something along the way -- who did the annotation of the > Wikipedia data? > > BTW, the OANC will soon come out with their 3.0 release of MASC (the > Manually Annotated Sub-Corpus), with about 800k tokens of English text > (multiple domains, including twitter, blogs, transcribed spoken, and more) > labeled with several different levels of analysis, including chunks (noun > and verb), entities, tokens, POS tags, sentence boundaries, and logical > forms. > > http://www.americannationalcorpus.org/MASC/Home.html > > Jason,
It looks interesting; but, they only provide annotations for the data with 80K words right now. They have data-sets for the others only. :-( But, they provide a subset of 40K words in CoNNL 08 format. With our architecture, it doesn't matter much on the format it is just on getting a converter to extract the data we need. Looks like we could even train tokenizer and sentence detector on the structure provided.