You could use the MASC annotations. I have a walk through for converting the data to formats suitable for Chalk (and compatible with OpenNLP) here: https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
There is still some work to be done in terms of how the annotations are extracted, options to training and so on, but it does serve as a benchmark. BTW, I've just recently finished integrating Liblinear into Nak (which is an adaptation of the maxent portion of OpenNLP). I'm still rounding some things out, but so far it is producing more accurate models that are trained in less time and without using cutoffs. Here's the code: https://github.com/scalanlp/nak It is still mostly Java, but the liblinear adaptors are in Scala. I've kept things such that liblinear retrofits to the interfaces that were in opennlp.maxent, though given how well it is working, I'll be stripping those out and going with liblinear for everything in upcoming versions. Happy to answer any questions or help out with any of the above if it might be useful! -Jason On Fri, Mar 22, 2013 at 8:08 AM, Jörn Kottmann <[email protected]> wrote: > On 03/22/2013 01:05 PM, William Colen wrote: > >> We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by >> detokenizing it, and creating documents from it. >> >> If it is OK to do it with other language, the AD corpus has paragraph and >> text annotations, as well as the original sentences (not tokenized). >> > > For English we should be able to use some of the CONLL data, yes, we should > definitely test with other languages too. Leipzig might be suited for > sentence detector > training, but not for tokenizer training, since the data is not tokenized > as far as I know. > > +1 to use AD and CONLL for testing the tokenizer and sentence detector. > > Jörn > -- Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
