Yes, we should store the class name of the Factory in the model, because storing the class itself there is a security problem.
Anyway in my opinion you don't want to add an extra jar file to the classpath just for a custom EOS character configuration. So we should do both. Jörn On Thu, Feb 9, 2012 at 10:15 AM, Katrin Tomanek <[email protected]>wrote: > Hi Jörn, > > but I think one should even go a step further and store the factory in the > model. > > At the moment, when instantiating a new Sentence Detector this happens: > > public SentenceDetectorME(**SentenceModel model) { > this(model, new Factory()); > } > > This means, that the factory is not stored in the model. Thus, if I use a > specific factory (because, e.g., you want a special way to generate the > features/context), you currently have no way to store this in the model. > > This could be come a problem, if you trained a model with one kind of > context generator and apply this model on events which come from another > context generator. Since the features are different, applying the model > would make too much sense... > > Best > Katrin > > > On 02/09/2012 10:10 AM, Joern Kottmann wrote: > >> We alreay have a properties file inside the model. It wouldn't be a >> difficult >> fix to add a property to it which stores the EOS characters which have >> been >> used during training. >> >> Jörn >> >> On Thu, Feb 9, 2012 at 10:06 AM, Katrin Tomanek >> <[email protected]>**wrote: >> >> Hi Jörn, >>> >>> thanks for this explanation. >>> What you are saying means, that the context generator and the eos scanner >>> are not stored in the model, right? >>> >>> I had assumed this... other ML toolkits, such as e.g. Mallet (which uses >>> the "Pipe"-logic where openlp uses event streams) actually does this. >>> >>> Maybe this would also be a good improvement... >>> >>> Best >>> Katrin >>> >>> On 02/09/2012 09:56 AM, Joern Kottmann wrote: >>> >>> When you only do it during training then it will not consider ":" as >>>> a possible split during detection. That explains your drop in accuracy. >>>> >>>> It looks like that it is not possible to modify the EOS characters >>>> properly >>>> with >>>> the current version. I suggest that you checkout the source code and >>>> then >>>> change the defaultEosCharacters array in opennlp.tools.sentdetect.** >>>> >>>> Factory. >>>> With that you are able to do your test and get it working for now. >>>> >>>> Anyway we should have an easy way to specify the EOS characters without >>>> implementing a custom Factory class. >>>> >>>> Please open a jira to improve this. >>>> >>>> Jörn >>>> >>>> On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek >>>> <[email protected]>****wrote: >>>> >>>> Hi Jörn, >>>> >>>>> >>>>> I only modified the training process. >>>>> >>>>> However, when I check the predictions it turns out that the model never >>>>> learns to split at ":" positions. >>>>> >>>>> Shouldn't it be enought to modify the DefaultSDContextGenerator and the >>>>> DefaultEndOfSentenceScanner so that these know about ":" as an EOS, >>>>> right? >>>>> Or are there other places where ":" should be added? >>>>> >>>>> Best >>>>> Katrin >>>>> >>>>> >>>>> >>>>> On 02/09/2012 09:18 AM, Joern Kottmann wrote: >>>>> >>>>> Did you modify the evaluation as well? If you just do it during >>>>> training >>>>> >>>>>> the >>>>>> evaluator will not be able to consider ":" as en EOS character. >>>>>> >>>>>> For me it sounds like that it fails to split on the ":" in some place. >>>>>> >>>>>> The sentence detector uses a maxent model to classify every EOS >>>>>> character >>>>>> as either a SPLIT or NO_SPLIT. >>>>>> >>>>>> Jörn >>>>>> >>>>>> On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek >>>>>> <[email protected]>******wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi Willian, >>>>>> >>>>>> >>>>>>> I am currently using opennlp-1.5.2 and try to use it as an API, i.e. >>>>>>> not >>>>>>> to modify this code by write my own code around it. However, what I >>>>>>> described below (with the SDEventStream) results in the same as you >>>>>>> are >>>>>>> describing: I am changing the set of EOS characters. >>>>>>> >>>>>>> I am just wondering, why adding ":" as an EOS character decreases the >>>>>>> results (dropping von ~80F to 45F in sentence splitting, and ":" is >>>>>>> always >>>>>>> a sentence boundary symbol in my data!) >>>>>>> >>>>>>> Looks like I need to debug a little bit more whats happening in the >>>>>>> DefaultSDContextGenerator. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>> Dr. Katrin Tomanek >>>>> Averbis GmbH >>>>> Tennenbacher Strasse 11 >>>>> D-79106 Freiburg >>>>> >>>>> Fon: +49 (0) 761 - 203 97696 >>>>> Fax: +49 (0) 761 - 203 97694 >>>>> E-Mail: [email protected] >>>>> >>>>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó >>>>> Sitz der Gesellschaft: Freiburg i. Br. >>>>> AG Freiburg i. Br., HRB 701080 >>>>> >>>>> >>>>> >>>> >>> -- >>> Dr. Katrin Tomanek >>> Averbis GmbH >>> Tennenbacher Strasse 11 >>> D-79106 Freiburg >>> >>> Fon: +49 (0) 761 - 203 97696 >>> Fax: +49 (0) 761 - 203 97694 >>> E-Mail: [email protected] >>> >>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó >>> Sitz der Gesellschaft: Freiburg i. Br. >>> AG Freiburg i. Br., HRB 701080 >>> >>> >> > > -- > Dr. Katrin Tomanek > Averbis GmbH > Tennenbacher Strasse 11 > D-79106 Freiburg > > Fon: +49 (0) 761 - 203 97696 > Fax: +49 (0) 761 - 203 97694 > E-Mail: [email protected] > > Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó > Sitz der Gesellschaft: Freiburg i. Br. > AG Freiburg i. Br., HRB 701080 >
