Ah, I was wondering about this, but I misinterpreted what is written on the wiki page! Thanks Jim.
I was trying to find the original opennlp corpora used for training, but could not get anything apart from the binary model... Anyone has any idea on whether it is possible to get this and how? Thanks Danica On 3 September 2013 17:16, Jim - FooBar(); <[email protected]> wrote: > You can't do that...there is no way to get the original corpus from the > model! You seem to think that the trained model is just a bunch of > sentences whereas the model is a bunch of binary features. If you want to > improve the already trained sentence-detector you need to get your hands on > the original corpus Jorn and the rest of the dev-team used... > > make sure you understand what the models consist of... > > HTH, > Jim > > > > On 03/09/13 16:27, Danica Damljanovic wrote: > >> Jork: Yes, but that's if I want to train the model using input.txt and >> then >> read it again into output.txt. >> >> What I am trying to do is read out the corpus of sentences that are used >> to >> create en-sent.bin, >> so that I can improve it. Any hints on how can I do that? >> >> In other words, I am trying to read >> http://opennlp.sourceforge.**net/models-1.5/en-sent.bin<http://opennlp.sourceforge.net/models-1.5/en-sent.bin>into >> en-sent.train >> txt file that I can open and update (and then retrain the sentence >> splitter). >> >> >> On 3 September 2013 16:19, Jörn Kottmann <[email protected]> wrote: >> >> On 09/03/2013 04:57 PM, Danica Damljanovic wrote: >>> >>> If I may ask one more question about retraining the sentence detector. I >>>> created a corpus that I want to use for training, but I would rather >>>> improve on the existing sentence splitter, so this is what I did to get >>>> the >>>> initial corpus: >>>> >>>> opennlp SentenceDetector en-sent.bin > output.txt >>>> >>>> However, although I gave the process 4GB of memory, it seems to be >>>> running >>>> for a while, and the only output I see is: >>>> >>>> >>>> Loading Sentence Detector model ... done (0.031s) >>>> >>>> What I expect is to see the list of sentences used for training, so >>>> that I >>>> can merge output.txt with my corpus, and retrain the parser. But after >>>> more >>>> than one hour, it still did not start to write into output.txt. >>>> >>>> >>> You need to provide some input text to the Sentence Detector, otherwise >>> it >>> just waits forever for it, >>> have a look at the manual for a sample. >>> >>> Jörn >>> >>> >
