Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Russ, Daniel (NIH/CIT) [E] Thu, 12 Nov 2015 07:45:06 -0800

Chris,
    Joern is correct.  However, If I can slightly disagree on a few minor 
points.


1) I use the old sourceforge models.  I find that the source of error in my 
analysis are usually not do to mistakes in sentence detection or POS tagging.  
I don’t have the annotated data or the time/money to build custom models.  Yes, 
the text I analyze is quite different than the (WSJ? or what corpus was used to 
build the models), but it is good enough.

2)  MaxEnt is still good classifier for NLP, and L-BFGS is just a algorithm to 
calculate the weights for the features.  It is an improvement on GIS, not a 
different classifier.  I am not familiar enough with CRF’s to comment, but the 
seminal paper by Della Peitra (IEEE trans part anal and Mach intel, v19 no4 
1997) make it appear as an extension of MaxEnt.  The Stanford NLP groups have 
ppt lectures online explain why discriminative classification methods (e.g. 
MaxEnt) work better then generative (Naive Bayes) models (see 
https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf; 
particularly the example with traffic lights)

As I briefly mentioned earlier.  OpenNLP is a mature product.  It has undergone 
some MAJOR upgrades.  It is not obsolete.  As for the other tools/libraries, 
they are also fine products.  I use the Stanford parser to get dependency 
information.  OpenNLP just does not do it.  I don’t use NTLK because I need to. 
 If the need arises, I will.  I assume that you don’t have the time and money 
to learn every new NLP product.  I would say play to your strengths. If you 
know the package use it. Don’t change because it’s trendy.



Daniel Russ, Ph.D.
Staff Scientist, Division of Computational Bioscience
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Nov 11, 2015, at 4:41 PM, Joern Kottmann 
<kottm...@gmail.com<mailto:kottm...@gmail.com>> wrote:

Hello,



It is definitely true that OpenNLP exists for a long time (more than 10
years), but that doesn't mean it wasn't improved. Actually it changed a
lot in that period.

The core strength of OpenNLP was always that it can be used really easy
to perform one of the supported NLP tasks.

This was further improved with the 1.5 release adding model packages
that ensure that the components are always instantiated correctly across
different runtime environments.

The problem is that the system used to perform the training of a model
and the system used to run it can be quite different. Prior 1.5 it was
possible to get that wrong, which resulted in hard to notice performance
problems.
I suspect that is an issue many of the competing solutions still have
today.

An example is the usage of String.toLowerCase(). The output out it
depends on the platform local.

One of the things that got dated a bit was the machine learning part of
OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
and L-BFGS maxent). In addition the machine learning part is now
plugable and can easily be switched against a different implementation
for testing or production use. The sandbox contains an experimental
mallet integration which offers all the mallet classifiers even CRF can
be used.

On Fri, 2015-11-06 at 16:54 +0000, Mattmann, Chris A (3980) wrote:
Hi Everyone,

Hope you¹re well! I¹m new to the list, however I just wanted to
state I¹m really happy with what I¹ve seen with OpenNLP so far.
My team and I have built a Tika Parser [1] that uses OpenNLP¹s
location NER model, along with a Lucene Geo Names Gazeeteer [2]
to create a ³GeoTopicParser². We are improving it day to day.
OpenNLP has definitely come a long way from when I looked at it
years ago in its nascence.

Do you use the old sourcefoge location model?

That said, I keep hearing from people I talk to in the NLP community
that for example OpenNLP is ³old², and that I should be looking
at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
issues (NER is GPL as an example and I am only interested in ALv2
or permissive license code), I don¹t have a great answer to whether
or not OpenNLP is old or not active, or not as good, etc. Can
devs on this list help me answer that question? I¹d like to be
able to tell these NLP people the next time I talk to them that
no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
these X and Y lines of development, and here¹s where they are
going, etc.

One of the big issues is that OpenNLP only works well if the model is
suitable for the use case of the user. The pre-trained models are
usually not and people tend to judge us by the performance of those old
models.

People who benchmark NLP software often put OpenNLP in a bad light by
comparing different solutions based on their pre-trained models. In my
opinion a fair comparison of statistical NLP software is only possible
if they all use exactly the same training data as done in one of the
many shared tasks, otherwise it is like comparing lap times of two cars
driving on two different race tracks.

I personally use OpenNLP a lot at work and it runs in many of our
customized NLP pipelines to analyze incoming texts. In those pipelines
it is often used in a highly customized setup and trained on the data
that is processed by that pipeline.

I think that must be its single biggest strength. It can be easily used
and customized, almost everything can be done without much hassle.
All components have a custom factory that can be used to change
fundamentally how components are put together, including custom code at
almost any place.

OpenNLP used to be maxent only, but that changed quite some time ago,
perceptron was added years back and now we just received a naive bayes
contribution and since 1.6.0 there is plug-able machine learning
support. Existing machine learning libraries such as mallet can be
integrated and used by all components, e.g training a mallet crf or
maxent based model is now possible (mallet integration can be found in
the sandbox).

Also small things are happening all the time. The name finder now has
support for word cluster dictionaries which can make all the difference
between a badly and very well performing model.

Other NLP libraries such as the mentioned Stanford NER are missing
features. OpenNLP comes with built-in evolution support and can
benchmark itself on a dataset by using cross validation, this includes
tools which output detailed information about recognition mistakes, it
has built-in support for many different corpora, such as OntoNotes,
CONLL or user created brat corpora.

The main difference between OpenNLP and Stanford NER / NLTK is probably
that OpenNLP is developed for usage in production systems and not for
researching or teaching purposes.

Jörn

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Reply via email to