On 02/08/2013 06:46 PM, Surendra wrote:
Hi, I am a post graduate student in computer science. I am working on sentence boundary detection of local Indian language. Could you please provide me the format of the train file and a sample file like en-sent.train which will be help full for me to create model.
The sentence detector training data to train the en-sent.bin model is not Open Source. The easiest way to get training data is to get a corpus and just extract the sentences for the training, there are a couple of freely or cheaply available corpora which could be used. Some are already supported by OpenNLP, have a look at the manual.
Jörn
