+1 to add format support for MASC directly to OpenNLP, I will open a jira issue for it.
Looks like there is data to train most of our components.

Jörn

On 03/22/2013 03:08 PM, Jason Baldridge wrote:
You could use the MASC annotations. I have a walk through for converting
the data to formats suitable for Chalk (and compatible with OpenNLP) here:
https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

There is still some work to be done in terms of how the annotations are
extracted, options to training and so on, but it does serve as a benchmark.

BTW, I've just recently finished integrating Liblinear into Nak (which is
an adaptation of the maxent portion of OpenNLP). I'm still rounding some
things out, but so far it is producing more accurate models that are
trained in less time and without using cutoffs. Here's the code:
https://github.com/scalanlp/nak

It is still mostly Java, but the liblinear adaptors are in Scala. I've kept
things such that liblinear retrofits to the interfaces that were in
opennlp.maxent, though given how well it is working, I'll be stripping
those out and going with liblinear for everything in upcoming versions.

Happy to answer any questions or help out with any of the above if it might
be useful!

-Jason

On Fri, Mar 22, 2013 at 8:08 AM, Jörn Kottmann <kottm...@gmail.com> wrote:

On 03/22/2013 01:05 PM, William Colen wrote:

We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).

For English we should be able to use some of the CONLL data, yes, we should
definitely test with other languages too. Leipzig might be suited for
sentence detector
training, but not for tokenizer training, since the data is not tokenized
as far as I know.

+1 to use AD and CONLL for testing the tokenizer and sentence detector.

Jörn




Reply via email to