Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread Jörn Kottmann

Hello,

do we have any public data we can test the sentence detector and 
tokenizer on?

It would be nice to remove the private data test for these at some point.

Jörn

On 03/08/2013 03:11 PM, William Colen wrote:

Hi all,

Our second release candidate is ready for testing. RC1 failed to pass the
initial quality check.

The RC 2 can be downloaded from here:
http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL to
your settings.xml file:
https://repository.apache.org/content/repositories/orgapacheopennlp-005/

The current test plan can be found here:
https://cwiki.apache.org/OPENNLP/testplan153.html

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html

The RC contains quite some changes, please refer to the contained issue
list for details.

William





Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread William Colen
We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).


On Fri, Mar 22, 2013 at 8:41 AM, Jörn Kottmann  wrote:

> Hello,
>
> do we have any public data we can test the sentence detector and tokenizer
> on?
> It would be nice to remove the private data test for these at some point.
>
> Jörn
>
>
> On 03/08/2013 03:11 PM, William Colen wrote:
>
>> Hi all,
>>
>> Our second release candidate is ready for testing. RC1 failed to pass the
>> initial quality check.
>>
>> The RC 2 can be downloaded from here:
>> http://people.apache.org/~**colen/releases/opennlp-1.5.3/**rc2/
>>
>> To use it in a maven build set the version for opennlp-tools or
>> opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL
>> to
>> your settings.xml file:
>> https://repository.apache.org/**content/repositories/**
>> orgapacheopennlp-005/
>>
>> The current test plan can be found here:
>> https://cwiki.apache.org/**OPENNLP/testplan153.html
>>
>> Please sign up for tasks in the test plan.
>>
>> The release plan can be found here:
>> https://cwiki.apache.org/**OPENNLP/**releaseplanandtasks153.html
>>
>> The RC contains quite some changes, please refer to the contained issue
>> list for details.
>>
>> William
>>
>>
>


Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread Jörn Kottmann

On 03/22/2013 01:05 PM, William Colen wrote:

We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).


For English we should be able to use some of the CONLL data, yes, we should
definitely test with other languages too. Leipzig might be suited for 
sentence detector
training, but not for tokenizer training, since the data is not 
tokenized as far as I know.


+1 to use AD and CONLL for testing the tokenizer and sentence detector.

Jörn


Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread Jason Baldridge
You could use the MASC annotations. I have a walk through for converting
the data to formats suitable for Chalk (and compatible with OpenNLP) here:
https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

There is still some work to be done in terms of how the annotations are
extracted, options to training and so on, but it does serve as a benchmark.

BTW, I've just recently finished integrating Liblinear into Nak (which is
an adaptation of the maxent portion of OpenNLP). I'm still rounding some
things out, but so far it is producing more accurate models that are
trained in less time and without using cutoffs. Here's the code:
https://github.com/scalanlp/nak

It is still mostly Java, but the liblinear adaptors are in Scala. I've kept
things such that liblinear retrofits to the interfaces that were in
opennlp.maxent, though given how well it is working, I'll be stripping
those out and going with liblinear for everything in upcoming versions.

Happy to answer any questions or help out with any of the above if it might
be useful!

-Jason

On Fri, Mar 22, 2013 at 8:08 AM, Jörn Kottmann  wrote:

> On 03/22/2013 01:05 PM, William Colen wrote:
>
>> We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
>> detokenizing it, and creating documents from it.
>>
>> If it is OK to do it with other language, the AD corpus has paragraph and
>> text annotations, as well as the original sentences (not tokenized).
>>
>
> For English we should be able to use some of the CONLL data, yes, we should
> definitely test with other languages too. Leipzig might be suited for
> sentence detector
> training, but not for tokenizer training, since the data is not tokenized
> as far as I know.
>
> +1 to use AD and CONLL for testing the tokenizer and sentence detector.
>
> Jörn
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge


Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread Jörn Kottmann
+1 to add format support for MASC directly to OpenNLP, I will open a 
jira issue for it.

Looks like there is data to train most of our components.

Jörn

On 03/22/2013 03:08 PM, Jason Baldridge wrote:

You could use the MASC annotations. I have a walk through for converting
the data to formats suitable for Chalk (and compatible with OpenNLP) here:
https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

There is still some work to be done in terms of how the annotations are
extracted, options to training and so on, but it does serve as a benchmark.

BTW, I've just recently finished integrating Liblinear into Nak (which is
an adaptation of the maxent portion of OpenNLP). I'm still rounding some
things out, but so far it is producing more accurate models that are
trained in less time and without using cutoffs. Here's the code:
https://github.com/scalanlp/nak

It is still mostly Java, but the liblinear adaptors are in Scala. I've kept
things such that liblinear retrofits to the interfaces that were in
opennlp.maxent, though given how well it is working, I'll be stripping
those out and going with liblinear for everything in upcoming versions.

Happy to answer any questions or help out with any of the above if it might
be useful!

-Jason

On Fri, Mar 22, 2013 at 8:08 AM, Jörn Kottmann  wrote:


On 03/22/2013 01:05 PM, William Colen wrote:


We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).


For English we should be able to use some of the CONLL data, yes, we should
definitely test with other languages too. Leipzig might be suited for
sentence detector
training, but not for tokenizer training, since the data is not tokenized
as far as I know.

+1 to use AD and CONLL for testing the tokenizer and sentence detector.

Jörn








Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-22 Thread Jörn Kottmann

Issues is here:
https://issues.apache.org/jira/browse/OPENNLP-565

Jörn

On 03/22/2013 03:17 PM, Jörn Kottmann wrote:
+1 to add format support for MASC directly to OpenNLP, I will open a 
jira issue for it.

Looks like there is data to train most of our components.

Jörn

On 03/22/2013 03:08 PM, Jason Baldridge wrote:

You could use the MASC annotations. I have a walk through for converting
the data to formats suitable for Chalk (and compatible with OpenNLP) 
here:

https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

There is still some work to be done in terms of how the annotations are
extracted, options to training and so on, but it does serve as a 
benchmark.


BTW, I've just recently finished integrating Liblinear into Nak 
(which is

an adaptation of the maxent portion of OpenNLP). I'm still rounding some
things out, but so far it is producing more accurate models that are
trained in less time and without using cutoffs. Here's the code:
https://github.com/scalanlp/nak

It is still mostly Java, but the liblinear adaptors are in Scala. 
I've kept

things such that liblinear retrofits to the interfaces that were in
opennlp.maxent, though given how well it is working, I'll be stripping
those out and going with liblinear for everything in upcoming versions.

Happy to answer any questions or help out with any of the above if it 
might

be useful!

-Jason

On Fri, Mar 22, 2013 at 8:08 AM, Jörn Kottmann  
wrote:



On 03/22/2013 01:05 PM, William Colen wrote:

We could do it with Leipzig corpus, or CONLL. We can prepare the 
corpus by

detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has 
paragraph and

text annotations, as well as the original sentences (not tokenized).

For English we should be able to use some of the CONLL data, yes, we 
should

definitely test with other languages too. Leipzig might be suited for
sentence detector
training, but not for tokenizer training, since the data is not 
tokenized

as far as I know.

+1 to use AD and CONLL for testing the tokenizer and sentence detector.

Jörn










Liblinear (was: OpenNLP 1.5.3 RC 2 ready for testing)

2013-03-22 Thread Jörn Kottmann

Sounds interesting, I hope we will find the time to do that in OpenNLP
after the 1.5.3 release too. We already discussed this and I think had 
consensus

on making the machine learning pluggable and then offer a few addons for
existing libraries.

Good to know that liblinear works well, as far as I know its written in 
C/C++,

did you use the Java port of it, or did you wrote a JNI interface?

Jörn

On 03/22/2013 03:08 PM, Jason Baldridge wrote:

BTW, I've just recently finished integrating Liblinear into Nak (which is
an adaptation of the maxent portion of OpenNLP). I'm still rounding some
things out, but so far it is producing more accurate models that are
trained in less time and without using cutoffs. Here's the code:
https://github.com/scalanlp/nak

It is still mostly Java, but the liblinear adaptors are in Scala. I've kept
things such that liblinear retrofits to the interfaces that were in
opennlp.maxent, though given how well it is working, I'll be stripping
those out and going with liblinear for everything in upcoming versions.

Happy to answer any questions or help out with any of the above if it might
be useful!




Re: Liblinear (was: OpenNLP 1.5.3 RC 2 ready for testing)

2013-03-22 Thread Jason Baldridge
I used the Java port. I actually pulled it into nak as nak.liblinear
because the model write/read code did it as text files and I needed access
to the Model member fields in order to do the serialization how I wanted.
Otherwise it remains as is. With a little bit of adaptation, you could
provide a Java wrapper in OpenNLP that follows the same pattern as my Scala
stuff. You'd just need to make it implement AbstractModel, which shouldn't
be too hard. (I have it implement LinearModel, which is just a slight
modification of MaxentModel, and I changed all uses of AbstractModel to
LinearModel in Chalk [the opennlp.tools portion]). -j

On Fri, Mar 22, 2013 at 9:32 AM, Jörn Kottmann  wrote:

> Sounds interesting, I hope we will find the time to do that in OpenNLP
> after the 1.5.3 release too. We already discussed this and I think had
> consensus
> on making the machine learning pluggable and then offer a few addons for
> existing libraries.
>
> Good to know that liblinear works well, as far as I know its written in
> C/C++,
> did you use the Java port of it, or did you wrote a JNI interface?
>
> Jörn
>
> On 03/22/2013 03:08 PM, Jason Baldridge wrote:
>
>> BTW, I've just recently finished integrating Liblinear into Nak (which is
>> an adaptation of the maxent portion of OpenNLP). I'm still rounding some
>> things out, but so far it is producing more accurate models that are
>> trained in less time and without using cutoffs. Here's the code:
>> https://github.com/scalanlp/**nak 
>>
>> It is still mostly Java, but the liblinear adaptors are in Scala. I've
>> kept
>> things such that liblinear retrofits to the interfaces that were in
>> opennlp.maxent, though given how well it is working, I'll be stripping
>> those out and going with liblinear for everything in upcoming versions.
>>
>> Happy to answer any questions or help out with any of the above if it
>> might
>> be useful!
>>
>
>


-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge