Re: new tool training

Russ, Daniel (NIH/CIT) [E] Thu, 27 Oct 2016 08:57:21 -0700

Comment 1:
Here is my Stacktrace:

Exception in thread "main" java.lang.IllegalArgumentException: Number of 
threads must 1 or larger
        at 
opennlp.tools.ml.maxent.quasinewton.ParallelNegLogLikelihood.<init>(ParallelNegLogLikelihood.java:50)
        at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.trainModel(QNTrainer.java:165)
        at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:152)
        at 
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:1)
        at gov.nih.cit.test.smlm.SOCcerME.train(SOCcerME.java:131)
        at gov.nih.cit.test.smlm.SOCcerME.main(SOCcerME.java:310)


So I threw together a small toy to get the same error.  It worked…  Let me 
track this down some more.


Comment 2:
Do you have a preference where the variable should go?  I think AbstractTrainer 
is the appropriate place for PSF variable dealing with ALL trainers, so 
Threads_(P/D) should be there.  I would remove and refactor out of 
TrainingParams.

Comment 3:
Right I want to change the dataindexer.

So I have multiple models that classify data (Job descriptions) into 
Occupational Codes.  I know what the codes are aprori, and even if they are not 
in the training data, I need to make sure that there is SOME probability for 
the codes.  More importantly for each job description, I need to compare the 
probabilities returned for each output.  By forcing the output indices to have 
the same values, I can quickly compare them without re-mapping the output.

I tried to extend OnePassDataIndex, but the indexing occurs during object 
construction, so I cannot set the known outputs before indexing occurs.  

Of course I would not need the getDataIndexer() method,  but it is defined in 
the Abstract class, why not in the Interface


On 10/27/16, 11:15 AM, "Joern Kottmann" <[email protected]> wrote:

    On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] <
    [email protected]> wrote:
    
    > Hello,
    >
    > Background:
    >    I am developing a tool that uses OpenNLP.  I have a model that extends
    > BaseModel, and several AbstractModels.  I allow the user (myself) to
    > specify the TrainerType (GIS/QN) for each model by using a list of
    > TrainingParameters.
    >
    > Potential Bugs:
    >
    > 1)    Whenever I use QNTrainer, I get an error (number of threads <1).  I
    > think the problem is that the parameters are initialized in the isValid()
    > method instead of the init() method.  This works for GIS because in the
    > doTrain(DataIndexer) method, the number of threads is a local variable
    > taken from the TrainingParameters not a field in GIS.  This leads to
    > another question. When it the isValid() method supposed to be called?  I 
am
    > surprised that the TrainerFactory does not call it.
    >
    >
    It should be be called from the factory I think. Currently it is only
    called when the training starts in
    AbstractEventTrainer.train(ObjectStream<Event>). It is always better to
    make things fail as early as possible.
    
    Can you share the exception stack trace? I don't really understand yet why
    you get this error with the QNTrainer. I would like to investigate that.
    
   


    >
    > 2)    The psf (public static final) String variables used by the
    > TrainingParameters are all over the place.  The variables
    > THEADS_(PARAM/DEFAULT) are defined in both QNTrainer and
    > TrainingParameters.  It should be defined in one of the places. I am not
    > sure that AbstractTrainer isn’t the best place to put THREADS_(P/D).  It
    > isn’t just the variables Threads_(P/D), All the Training psf String
    > variables from TrainingParameters are duplicated in AbstractTrainer.
    >
    >
    I agree, the commonly used variables should only be in one place. Some
    trainer have specific variables which are not shared. It would be nice to
    get this re-factored.
    
    
    
    >
    > 3)    Should the Interface EventTrainer have a doTrainDataIndexer and a
    > getDataIndexer method?  This is important to me because I extended
    > OnePassDataIndexer to pre-assign the outputs.  I know the outputs aprori,
    > and I want to quick combine the results of the multiple models.  Since the
    > getEventTrainer returns an EventTrainer instead of an 
AbstractEventTrainer,
    > I cannot call doTrain(DataIndexer).  I cannot use the
    > doTrain(ObjectStream<Event>); it creates a new OnePassIndexer.
    >
    
    
    I think it would be fine to add a second train method to EventTrainer which
    takes a DataIndexer. The current train method should probably be changed a
    bit, and not do the init things.
    It would like to understand your usage here a bit better.
    
    So you want to have control over the DataIndexer which is used for
    training, right?
    Another option could be to have a second train(ObjectStream<Event>,
    DataIndexer) method.
    
    And why would you need a getDataIndexer method if you can pass in your own
    instance?
    
    
    Jörn

Re: new tool training

Reply via email to