Comment 1:
Here is my Stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Number of
threads must 1 or larger
at
opennlp.tools.ml.maxent.quasinewton.ParallelNegLogLikelihood.<init>(ParallelNegLogLikelihood.java:50)
at
opennlp.tools.ml.maxent.quasinewton.QNTrainer.trainModel(QNTrainer.java:165)
at
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:152)
at
opennlp.tools.ml.maxent.quasinewton.QNTrainer.doTrain(QNTrainer.java:1)
at gov.nih.cit.test.smlm.SOCcerME.train(SOCcerME.java:131)
at gov.nih.cit.test.smlm.SOCcerME.main(SOCcerME.java:310)
So I threw together a small toy to get the same error. It worked… Let me
track this down some more.
Comment 2:
Do you have a preference where the variable should go? I think AbstractTrainer
is the appropriate place for PSF variable dealing with ALL trainers, so
Threads_(P/D) should be there. I would remove and refactor out of
TrainingParams.
Comment 3:
Right I want to change the dataindexer.
So I have multiple models that classify data (Job descriptions) into
Occupational Codes. I know what the codes are aprori, and even if they are not
in the training data, I need to make sure that there is SOME probability for
the codes. More importantly for each job description, I need to compare the
probabilities returned for each output. By forcing the output indices to have
the same values, I can quickly compare them without re-mapping the output.
I tried to extend OnePassDataIndex, but the indexing occurs during object
construction, so I cannot set the known outputs before indexing occurs.
Of course I would not need the getDataIndexer() method, but it is defined in
the Abstract class, why not in the Interface
On 10/27/16, 11:15 AM, "Joern Kottmann" <[email protected]> wrote:
On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] <
[email protected]> wrote:
> Hello,
>
> Background:
> I am developing a tool that uses OpenNLP. I have a model that extends
> BaseModel, and several AbstractModels. I allow the user (myself) to
> specify the TrainerType (GIS/QN) for each model by using a list of
> TrainingParameters.
>
> Potential Bugs:
>
> 1) Whenever I use QNTrainer, I get an error (number of threads <1). I
> think the problem is that the parameters are initialized in the isValid()
> method instead of the init() method. This works for GIS because in the
> doTrain(DataIndexer) method, the number of threads is a local variable
> taken from the TrainingParameters not a field in GIS. This leads to
> another question. When it the isValid() method supposed to be called? I
am
> surprised that the TrainerFactory does not call it.
>
>
It should be be called from the factory I think. Currently it is only
called when the training starts in
AbstractEventTrainer.train(ObjectStream<Event>). It is always better to
make things fail as early as possible.
Can you share the exception stack trace? I don't really understand yet why
you get this error with the QNTrainer. I would like to investigate that.
>
> 2) The psf (public static final) String variables used by the
> TrainingParameters are all over the place. The variables
> THEADS_(PARAM/DEFAULT) are defined in both QNTrainer and
> TrainingParameters. It should be defined in one of the places. I am not
> sure that AbstractTrainer isn’t the best place to put THREADS_(P/D). It
> isn’t just the variables Threads_(P/D), All the Training psf String
> variables from TrainingParameters are duplicated in AbstractTrainer.
>
>
I agree, the commonly used variables should only be in one place. Some
trainer have specific variables which are not shared. It would be nice to
get this re-factored.
>
> 3) Should the Interface EventTrainer have a doTrainDataIndexer and a
> getDataIndexer method? This is important to me because I extended
> OnePassDataIndexer to pre-assign the outputs. I know the outputs aprori,
> and I want to quick combine the results of the multiple models. Since the
> getEventTrainer returns an EventTrainer instead of an
AbstractEventTrainer,
> I cannot call doTrain(DataIndexer). I cannot use the
> doTrain(ObjectStream<Event>); it creates a new OnePassIndexer.
>
I think it would be fine to add a second train method to EventTrainer which
takes a DataIndexer. The current train method should probably be changed a
bit, and not do the init things.
It would like to understand your usage here a bit better.
So you want to have control over the DataIndexer which is used for
training, right?
Another option could be to have a second train(ObjectStream<Event>,
DataIndexer) method.
And why would you need a getDataIndexer method if you can pass in your own
instance?
Jörn