Yes. Both POS Tagger and Chunker were trained with Bosque, which includes
4,212 sentences.

http://www.linguateca.pt/floresta/info_floresta_English.html


Regarding memory usage -
For the POS Tagger, the default dictionary is loaded to a hash table. For
my application I implemented a way to store the lexeme dictionary somewhere
else, like a database. I contributed to OpenNLP the code that allows
extending the default dictionary, but I could not contribute a different
implementation yet.
Also, I did a lot of experiments comparing model effectiveness and model
size, but most of them were not included in the text I pointed, but a few
chapters discuss that a little, like the Sentence Detector (6.2).



2013/10/10 Michael Schmitz <[email protected]>

> Hi William--thanks for the pointer.  Do you know the size of your
> training sets?  I did not see that in the chapters you pointed me to.
>
> On Mon, Oct 7, 2013 at 3:48 PM, William Colen <[email protected]>
> wrote:
> > Actually, I measured the model effectiveness, not the memory x
> performance.
> >
> >
> >
> >
> > 2013/10/7 Michael Schmitz <[email protected]>
> >
> >> Hi Jorn, let me be more precise.  Do you have a notion of how the
> >> precision-recall curve (AUC) changes as a function of the number of
> >> annotations?  I'm curious how many annotations are needed for a model
> >> with reasonable precision-recall AUC and reasonable performance
> >> (memory and speed).
> >>
> >> Peace.  Michael
> >>
> >> On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <[email protected]>
> wrote:
> >> > On 10/07/2013 11:00 PM, Michael Schmitz wrote:
> >> >>
> >> >> Do you know how many sentences/tokens were annotated for the OpenNLP
> >> >> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
> >> >> number of annotations vs performance?
> >> >
> >> >
> >> > If the model gets bigger the computations get more complex, but as far
> >> as I
> >> > know
> >> > the effect of the model not fitting anymore in the CPU cache is much
> more
> >> > significant then
> >> > that. I am using hash based int features to reduce the memory
> footprint
> >> in
> >> > the name finder.
> >> >
> >> > I don't have much experience with the Chunker or Pos Tagger in
> regards to
> >> > performance, but
> >> > it should be easy to do a series of tests, the command line tools have
> >> built
> >> > in performance monitoring.
> >> >
> >> > Jörn
> >>
>

Reply via email to