Re: OpenNLP Sentence Detector: EOS Characters

Joern Kottmann Thu, 09 Feb 2012 01:20:23 -0800

Yes, we should store the class name of the Factory in the model,
because storing the class itself there is a security problem.


Anyway in my opinion you don't want to add an extra jar file to the
classpath
just for a custom EOS character configuration.

So we should do both.

Jörn

On Thu, Feb 9, 2012 at 10:15 AM, Katrin Tomanek
<[email protected]>wrote:

> Hi Jörn,
>
> but I think one should even go a step further and store the factory in the
> model.
>
> At the moment, when instantiating a new Sentence Detector this happens:
>
>  public SentenceDetectorME(**SentenceModel model) {
>    this(model, new Factory());
>  }
>
> This means, that the factory is not stored in the model. Thus, if I use a
> specific factory (because, e.g., you want a special way to generate the
> features/context), you currently have no way to store this in the model.
>
> This could be come a problem, if you trained a model with one kind of
> context generator and apply this model on events which come from another
> context generator. Since the features are different, applying the model
> would make too much sense...
>
> Best
> Katrin
>
>
> On 02/09/2012 10:10 AM, Joern Kottmann wrote:
>
>> We alreay have a properties file inside the model. It wouldn't be a
>> difficult
>> fix to add a property to it which stores the EOS characters which have
>> been
>> used during training.
>>
>> Jörn
>>
>> On Thu, Feb 9, 2012 at 10:06 AM, Katrin Tomanek
>> <[email protected]>**wrote:
>>
>>  Hi Jörn,
>>>
>>> thanks for this explanation.
>>> What you are saying means, that the context generator and the eos scanner
>>> are not stored in the model, right?
>>>
>>> I had assumed this... other ML toolkits, such as e.g. Mallet (which uses
>>> the "Pipe"-logic where openlp uses event streams) actually does this.
>>>
>>> Maybe this would also be a good improvement...
>>>
>>> Best
>>> Katrin
>>>
>>> On 02/09/2012 09:56 AM, Joern Kottmann wrote:
>>>
>>>  When you only do it during training then it will not consider ":" as
>>>> a possible split during detection. That explains your drop in accuracy.
>>>>
>>>> It looks like that it is not possible to modify the EOS characters
>>>> properly
>>>> with
>>>> the current version. I suggest that you checkout the source code and
>>>> then
>>>> change the defaultEosCharacters array in opennlp.tools.sentdetect.**
>>>>
>>>> Factory.
>>>> With that you are able to do your test and get it working for now.
>>>>
>>>> Anyway we should have an easy way to specify the EOS characters without
>>>> implementing a custom Factory class.
>>>>
>>>> Please open a jira to improve this.
>>>>
>>>> Jörn
>>>>
>>>> On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek
>>>> <[email protected]>****wrote:
>>>>
>>>>  Hi Jörn,
>>>>
>>>>>
>>>>> I only modified the training process.
>>>>>
>>>>> However, when I check the predictions it turns out that the model never
>>>>> learns to split at ":" positions.
>>>>>
>>>>> Shouldn't it be enought to modify the DefaultSDContextGenerator and the
>>>>> DefaultEndOfSentenceScanner so that these know about ":" as an EOS,
>>>>> right?
>>>>> Or are there other places where ":" should be added?
>>>>>
>>>>> Best
>>>>> Katrin
>>>>>
>>>>>
>>>>>
>>>>> On 02/09/2012 09:18 AM, Joern Kottmann wrote:
>>>>>
>>>>>  Did you modify the evaluation as well? If you just do it during
>>>>> training
>>>>>
>>>>>> the
>>>>>> evaluator will not be able to consider ":" as en EOS character.
>>>>>>
>>>>>> For me it sounds like that it fails to split on the ":" in some place.
>>>>>>
>>>>>> The sentence detector uses a maxent model to classify every EOS
>>>>>> character
>>>>>> as either a SPLIT or NO_SPLIT.
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>> On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek
>>>>>> <[email protected]>******wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Hi Willian,
>>>>>>
>>>>>>
>>>>>>> I am currently using opennlp-1.5.2 and try to use it as an API, i.e.
>>>>>>> not
>>>>>>> to modify this code by write my own code around it. However, what I
>>>>>>> described below (with the SDEventStream) results in the same as you
>>>>>>> are
>>>>>>> describing: I am changing the set of EOS characters.
>>>>>>>
>>>>>>> I am just wondering, why adding ":" as an EOS character decreases the
>>>>>>> results (dropping von ~80F to 45F in sentence splitting, and ":" is
>>>>>>> always
>>>>>>> a sentence boundary symbol in my data!)
>>>>>>>
>>>>>>> Looks like I need to debug a little bit more whats happening in the
>>>>>>> DefaultSDContextGenerator.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  --
>>>>> Dr. Katrin Tomanek
>>>>> Averbis GmbH
>>>>> Tennenbacher Strasse 11
>>>>> D-79106 Freiburg
>>>>>
>>>>> Fon: +49 (0) 761 - 203 97696
>>>>> Fax: +49 (0) 761 - 203 97694
>>>>> E-Mail: [email protected]
>>>>>
>>>>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>>>> Sitz der Gesellschaft: Freiburg i. Br.
>>>>> AG Freiburg i. Br., HRB 701080
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Dr. Katrin Tomanek
>>> Averbis GmbH
>>> Tennenbacher Strasse 11
>>> D-79106 Freiburg
>>>
>>> Fon: +49 (0) 761 - 203 97696
>>> Fax: +49 (0) 761 - 203 97694
>>> E-Mail: [email protected]
>>>
>>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>> Sitz der Gesellschaft: Freiburg i. Br.
>>> AG Freiburg i. Br., HRB 701080
>>>
>>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: [email protected]
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>

Re: OpenNLP Sentence Detector: EOS Characters

Reply via email to