Hi Willian,

I am currently using opennlp-1.5.2 and try to use it as an API, i.e. not to modify this code by write my own code around it. However, what I described below (with the SDEventStream) results in the same as you are describing: I am changing the set of EOS characters.

I am just wondering, why adding ":" as an EOS character decreases the results (dropping von ~80F to 45F in sentence splitting, and ":" is always a sentence boundary symbol in my data!)

Looks like I need to debug a little bit more whats happening in the DefaultSDContextGenerator.


Best
Katrin


On 02/09/2012 02:11 AM, [email protected] wrote:
On Wed, Feb 8, 2012 at 3:16 PM, Katrin Tomanek
<[email protected]>wrote:

Hi Jörn,

Good, I'll have a look at the dev list tomorrow.

But still a question on the EOS symbols:

For some testing, I just overwrote the SentenceDetectorME.train(...)
method, where I basically changed the way the EventStream was so up to:

EventStream eventStream = new SDEventStream(**sampleStreamTrain,
        new DefaultSDContextGenerator(new char[]{'.', '!', '?',':'}),
        new DefaultEndOfSentenceScanner(**new char[]{'.', '!', '?',':'}));


-->  I thought doing so I would have added ":" as a possible sentence
boundary. However, this did not really help -- the model rather gets worse.
Maybe I still misunderstood something in how the EOS symbols work?


Maybe your issue is that you should set the EOS symbols in other places.
The SentenceDetectorME gets it from the Factory class. Maybe you will need
to create a sub-class from it.

http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java?view=markup

We are planning to make this process easier for the next release.

William



--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: [email protected]

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Reply via email to