Re: Re: English 300k sentences Leipzig Corpus for test

2013-03-14 Thread William Colen
Hi,

I could not find a way to convert from Leipzig to other formats than DocCat
sample. Is it possible to convert from Leipzig to SentenceSample using the
OpenNLP tools?

Thank you,
William


On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote:




  Original Message 
 Subject:Re: English 300k sentences Leipzig Corpus for test
 Date:   Thu, 14 Mar 2013 09:48:21 -0300
 From:   William Colen william.co...@gmail.com
 To: Jörn Kottmann kottm...@gmail.com



 Yes, you can forward.

 It is not clear to me how to convert it. I could only find converters from
 Leipzig to DocCat.


 On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote:

  Do you mind if I forward this to the dev list?

 Yes, you need to convert the data into input data. The idea
 is that we process the data with 1.5.2 and 1.5.3 and see if the output
 is still identical, if its not identical its either a change in our code
 or a bug.

 It doesn't really matter which file you download as long as it has enough
 sentences,
 would be nice if you can note in the test plan which one you used.

 Hopefully I will have sometime over the weekend to do the tests on the
 private data I have.

 Jörn


 On 03/13/2013 11:38 PM, William Colen wrote:

  Hi, Jörn,

 I would like to start testing with Leipzig Corpus. Do you know how the
 steps to do it?

 I downloaded the file named
 eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/**
 Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz,


 and now I would use the converter to extract documents from it.

 After that, I would try to use the output of a module as input to the
 next.
 Is it correct?

 Thank you,
 William









Re: English 300k sentences Leipzig Corpus for test

2013-03-14 Thread Jörn Kottmann
If I remember correctly the file is already sentences by line, I used 
the tokenizer to tokenize
it, and the POS Tagger to pos tag it. After you did that you have input 
files for all the tools.


Maybe you need to remove the sentence id at the begin, e.g. with sed. 
Anyway you can also leave it

there, it doesn't really matter for this test.

Jörn

On 03/14/2013 03:45 PM, William Colen wrote:

Hi,

I could not find a way to convert from Leipzig to other formats than DocCat
sample. Is it possible to convert from Leipzig to SentenceSample using the
OpenNLP tools?

Thank you,
William


On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote:




 Original Message 
Subject:Re: English 300k sentences Leipzig Corpus for test
Date:   Thu, 14 Mar 2013 09:48:21 -0300
From:   William Colen william.co...@gmail.com
To: Jörn Kottmann kottm...@gmail.com



Yes, you can forward.

It is not clear to me how to convert it. I could only find converters from
Leipzig to DocCat.


On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote:

  Do you mind if I forward this to the dev list?

Yes, you need to convert the data into input data. The idea
is that we process the data with 1.5.2 and 1.5.3 and see if the output
is still identical, if its not identical its either a change in our code
or a bug.

It doesn't really matter which file you download as long as it has enough
sentences,
would be nice if you can note in the test plan which one you used.

Hopefully I will have sometime over the weekend to do the tests on the
private data I have.

Jörn


On 03/13/2013 11:38 PM, William Colen wrote:

  Hi, Jörn,

I would like to start testing with Leipzig Corpus. Do you know how the
steps to do it?

I downloaded the file named
eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/**
Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz,


and now I would use the converter to extract documents from it.

After that, I would try to use the output of a module as input to the
next.
Is it correct?

Thank you,
William










Re: OpenNLP 1.5.3 RC 2 ready for testing

2013-03-14 Thread James Kosin

Hi William,

No, I think it will be fine.  The problem only lies in data where there 
is back to back names being tagged in the sentences.  The unfixed prior 
models would invalidly tag them with the wrong type... i.e.: both could 
be the same type such as person instead of the different types one 
person and the other maybe miscellaneous.


In some of the models; especially the combined Name Finder models that 
contained all the tags ... were affected most; since, the likelihood of 
back to back tags is higher.
In the English models there were 3 sentences that had improper tags 
before ... now have the correct tags with the fixes.  This improved the 
scores a bit.


It should produce identical models since the problem was with the output 
tagging and not with the training of the models.


James

On 3/14/2013 11:00 PM, William Colen wrote:

Hi, James,

Thank you for the warning. It didn't affect the test with the Leipzig
corpus: the output from 1.5.2 and 1.5.3 are identical. Do you think we
should better manually check the output?

Thank you,
William


On Thu, Mar 14, 2013 at 12:09 AM, James Kosin james.ko...@gmail.com wrote:


Hi all,

Note, that we will have some discrepancies in the model performance for
some of the tests in the NameFinder models due to OPENNLP-417 that fixes
the back-to-back name tags.

It should really be limited to the combined name tags; but, could also
affect others.

James



On 3/8/2013 9:11 AM, William Colen wrote:


Hi all,

Our second release candidate is ready for testing. RC1 failed to pass the
initial quality check.

The RC 2 can be downloaded from here:
http://people.apache.org/~**colen/releases/opennlp-1.5.3/**rc2/http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/

To use it in a maven build set the version for opennlp-tools or
opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL
to
your settings.xml file:
https://repository.apache.org/**content/repositories/**
orgapacheopennlp-005/https://repository.apache.org/content/repositories/orgapacheopennlp-005/

The current test plan can be found here:
https://cwiki.apache.org/**OPENNLP/testplan153.htmlhttps://cwiki.apache.org/OPENNLP/testplan153.html

Please sign up for tasks in the test plan.

The release plan can be found here:
https://cwiki.apache.org/**OPENNLP/**releaseplanandtasks153.htmlhttps://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html

The RC contains quite some changes, please refer to the contained issue
list for details.

William