If I remember correctly the file is already sentences by line, I used
the tokenizer to tokenize
it, and the POS Tagger to pos tag it. After you did that you have input
files for all the tools.
Maybe you need to remove the sentence id at the begin, e.g. with sed.
Anyway you can also leave it
there, it doesn't really matter for this test.
Jörn
On 03/14/2013 03:45 PM, William Colen wrote:
Hi,
I could not find a way to convert from Leipzig to other formats than DocCat
sample. Is it possible to convert from Leipzig to SentenceSample using the
OpenNLP tools?
Thank you,
William
On Thu, Mar 14, 2013 at 9:51 AM, Jörn Kottmann kottm...@gmail.com wrote:
Original Message
Subject:Re: English 300k sentences Leipzig Corpus for test
Date: Thu, 14 Mar 2013 09:48:21 -0300
From: William Colen william.co...@gmail.com
To: Jörn Kottmann kottm...@gmail.com
Yes, you can forward.
It is not clear to me how to convert it. I could only find converters from
Leipzig to DocCat.
On Thu, Mar 14, 2013 at 6:09 AM, Jörn Kottmann kottm...@gmail.com wrote:
Do you mind if I forward this to the dev list?
Yes, you need to convert the data into input data. The idea
is that we process the data with 1.5.2 and 1.5.3 and see if the output
is still identical, if its not identical its either a change in our code
or a bug.
It doesn't really matter which file you download as long as it has enough
sentences,
would be nice if you can note in the test plan which one you used.
Hopefully I will have sometime over the weekend to do the tests on the
private data I have.
Jörn
On 03/13/2013 11:38 PM, William Colen wrote:
Hi, Jörn,
I would like to start testing with Leipzig Corpus. Do you know how the
steps to do it?
I downloaded the file named
eng_news_2010_300K-text.tar.gzfile:///Users/wcolen/**
Desktop/opennlp1.5.3/eng_news_2010_300K-text.tar.gz,
and now I would use the converter to extract documents from it.
After that, I would try to use the output of a module as input to the
next.
Is it correct?
Thank you,
William