Yes. Both POS Tagger and Chunker were trained with Bosque, which includes 4,212 sentences.
http://www.linguateca.pt/floresta/info_floresta_English.html Regarding memory usage - For the POS Tagger, the default dictionary is loaded to a hash table. For my application I implemented a way to store the lexeme dictionary somewhere else, like a database. I contributed to OpenNLP the code that allows extending the default dictionary, but I could not contribute a different implementation yet. Also, I did a lot of experiments comparing model effectiveness and model size, but most of them were not included in the text I pointed, but a few chapters discuss that a little, like the Sentence Detector (6.2). 2013/10/10 Michael Schmitz <[email protected]> > Hi William--thanks for the pointer. Do you know the size of your > training sets? I did not see that in the chapters you pointed me to. > > On Mon, Oct 7, 2013 at 3:48 PM, William Colen <[email protected]> > wrote: > > Actually, I measured the model effectiveness, not the memory x > performance. > > > > > > > > > > 2013/10/7 Michael Schmitz <[email protected]> > > > >> Hi Jorn, let me be more precise. Do you have a notion of how the > >> precision-recall curve (AUC) changes as a function of the number of > >> annotations? I'm curious how many annotations are needed for a model > >> with reasonable precision-recall AUC and reasonable performance > >> (memory and speed). > >> > >> Peace. Michael > >> > >> On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <[email protected]> > wrote: > >> > On 10/07/2013 11:00 PM, Michael Schmitz wrote: > >> >> > >> >> Do you know how many sentences/tokens were annotated for the OpenNLP > >> >> POS and CHUNK models? Do you have an idea of the "sweet spot" for > >> >> number of annotations vs performance? > >> > > >> > > >> > If the model gets bigger the computations get more complex, but as far > >> as I > >> > know > >> > the effect of the model not fitting anymore in the CPU cache is much > more > >> > significant then > >> > that. I am using hash based int features to reduce the memory > footprint > >> in > >> > the name finder. > >> > > >> > I don't have much experience with the Chunker or Pos Tagger in > regards to > >> > performance, but > >> > it should be easy to do a series of tests, the command line tools have > >> built > >> > in performance monitoring. > >> > > >> > Jörn > >> >
