Jason & Jorn,

They say the support will be back up in October on the web-site.  The
CoNLL 2008 format looks promising.  But any of the others would probably
work.  They seem to have problems with the Penn Treebank format they
have several patches against that format.

James

On 8/9/2012 3:38 AM, Jörn Kottmann wrote:
> Maybe we can then even distribute these models from Apache.
> But in any case we should implement format support for the corpus,
> so that training OpenNLP on it is easy.
>
> Jörn
>
> On 08/09/2012 03:45 AM, Jason Baldridge wrote:
>> There is a link to a pre-release of the MASC data that I have but am not
>> sure I can share. I believe they are planning to have a finalized
>> version
>> out in September.
>>
>> AFAIK, the MASC data is unencumbered -- Nancy Ide is very committed to
>> having truly open data and annotations. It would be great if the
>> community
>> can give back to the OANC with further annotations, tools, and such
>> -- some
>> of the annotation stuff being discussed here would could be great for
>> this.
>>
>> On Wed, Aug 8, 2012 at 7:47 PM, James Kosin <james.ko...@gmail.com>
>> wrote:
>>
>>> http://www.anc.org/
>>>
>>> ... but, this suggests the data they collect is only for research and
>>> education.
>>>
>>> On 8/8/2012 10:31 AM, Jason Baldridge wrote:
>>>> Sorry if I missed something along the way -- who did the annotation of
>>> the
>>>> Wikipedia data?
>>>>
>>>> BTW, the OANC will soon come out with their 3.0 release of MASC (the
>>>> Manually Annotated Sub-Corpus), with about 800k tokens of English text
>>>> (multiple domains, including twitter, blogs, transcribed spoken, and
>>> more)
>>>> labeled with several different levels of analysis, including chunks
>>>> (noun
>>>> and verb), entities, tokens, POS tags, sentence boundaries, and
>>>> logical
>>>> forms.
>>>>
>>>> http://www.americannationalcorpus.org/MASC/Home.html
>>>>
>>>> On Wed, Aug 8, 2012 at 2:47 AM, Jörn Kottmann <kottm...@gmail.com>
>>> wrote:
>>>>> On 08/08/2012 06:16 AM, Michael Schmitz wrote:
>>>>>
>>>>>> Hi, here are some models trained on Wikipedia data.  They have
>>>>>> similar
>>>>>> performance.  Is this useful?
>>>>>>
>>>>> Yes, people who do not have access to our MUC based training
>>>>> data can just use the wiki data instead and combine it with their
>>>>> data.
>>>>>
>>>>> Thanks for sharing.
>>>>>
>>>>> Now all we need is a way to get label corrections from the
>>>>> community :-)
>>>>>
>>>>> Jörn
>>>>>
>>>>
>>>
>>
>

Reply via email to