Usually you don't need a huge training data set to have an effective model. You can measure the tradeoff between the training dataset size, the cutoff and the algorithm using the 10-fold cross-validation tool included in the OpenNLP command line interface. You would need to run different experiments changing these parameters. In your case not only the F-measure is important, but also the model size.
2014-05-27 18:59 GMT-03:00 Jeffrey Zemerick <[email protected]>: > I do not, William. I assumed it was due to the large training data set. I > will look into the things you mentioned. Thanks! > > > On Tue, May 27, 2014 at 3:35 PM, William Colen <[email protected] > >wrote: > > > Do you know why your model is so big? > > > > You can reduce its size by using a higher cutoff, or trying Perceptron. > You > > can also try using a entity dictionary, which will avoid the algorithm > > storing the entities in the form of features. > > > > I am not aware of a way to avoid loading it into memory. > > > > Regards, > > William > > > > 2014-05-27 16:11 GMT-03:00 Jeffrey Zemerick <[email protected]>: > > > > > Hi Users, > > > > > > Is anyone aware of a way to load a TokenNameFinder model and use it > > without > > > storing the entire model in memory? My models take up about 6 GB of > > memory. > > > I see in the code that the model files are unzipped and put into a > > HashMap. > > > Is it possible to store the data structure off-heap somewhere? > > > > > > Thanks, > > > Jeff > > > > > >
