Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Jason Baldridge Thu, 12 Nov 2015 11:51:12 -0800

As one of the people who got OpenNLP started in the late 1990's (for
research, but hoping it could be used by industry), it makes me smile to
know that lots of people use it happily to this day. :)


There are lots of new kids in town, but the licensing is often conflicted,
and the biggest benefits often come---as Joern mentions---by having the
right data to train your classifier.

Having said that, there is a lot of activity in the deep learning space,
where old techniques (neural nets) are now viable in ways they weren't
previously, and they are outperforming linear classifiers in task after
task. I'm currently looking at Deeplearning4J, and it would be great to
have OpenNLP or a project like it make solid NLP models available based on
deep learning methods, especially LSTMs and Convolutional Neural Nets.
Deeplearning4J is Java/Scala friendly and it is ASL, so that's at least
setting off on the right foot.

http://deeplearning4j.org/

The ND4J library (based on Numpy) that was built to support DL4J is also
likely to be useful for other Java projects that use machine learning.

-Jason

On Thu, 12 Nov 2015 at 09:44 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>
wrote:

> Chris,
>     Joern is correct.  However, If I can slightly disagree on a few minor
> points.
>
> 1) I use the old sourceforge models.  I find that the source of error in
> my analysis are usually not do to mistakes in sentence detection or POS
> tagging.  I don’t have the annotated data or the time/money to build custom
> models.  Yes, the text I analyze is quite different than the (WSJ? or what
> corpus was used to build the models), but it is good enough.
>
> 2)  MaxEnt is still good classifier for NLP, and L-BFGS is just a
> algorithm to calculate the weights for the features.  It is an improvement
> on GIS, not a different classifier.  I am not familiar enough with CRF’s to
> comment, but the seminal paper by Della Peitra (IEEE trans part anal and
> Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt.  The
> Stanford NLP groups have ppt lectures online explain why discriminative
> classification methods (e.g. MaxEnt) work better then generative (Naive
> Bayes) models (see
> https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf;
> particularly the example with traffic lights)
>
> As I briefly mentioned earlier.  OpenNLP is a mature product.  It has
> undergone some MAJOR upgrades.  It is not obsolete.  As for the other
> tools/libraries, they are also fine products.  I use the Stanford parser to
> get dependency information.  OpenNLP just does not do it.  I don’t use NTLK
> because I need to.  If the need arises, I will.  I assume that you don’t
> have the time and money to learn every new NLP product.  I would say play
> to your strengths. If you know the package use it. Don’t change because
> it’s trendy.
>
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Division of Computational Bioscience
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Nov 11, 2015, at 4:41 PM, Joern Kottmann <kottm...@gmail.com<mailto:
> kottm...@gmail.com>> wrote:
>
> Hello,
>
>
>
> It is definitely true that OpenNLP exists for a long time (more than 10
> years), but that doesn't mean it wasn't improved. Actually it changed a
> lot in that period.
>
> The core strength of OpenNLP was always that it can be used really easy
> to perform one of the supported NLP tasks.
>
> This was further improved with the 1.5 release adding model packages
> that ensure that the components are always instantiated correctly across
> different runtime environments.
>
> The problem is that the system used to perform the training of a model
> and the system used to run it can be quite different. Prior 1.5 it was
> possible to get that wrong, which resulted in hard to notice performance
> problems.
> I suspect that is an issue many of the competing solutions still have
> today.
>
> An example is the usage of String.toLowerCase(). The output out it
> depends on the platform local.
>
> One of the things that got dated a bit was the machine learning part of
> OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
> and L-BFGS maxent). In addition the machine learning part is now
> plugable and can easily be switched against a different implementation
> for testing or production use. The sandbox contains an experimental
> mallet integration which offers all the mallet classifiers even CRF can
> be used.
>
> On Fri, 2015-11-06 at 16:54 +0000, Mattmann, Chris A (3980) wrote:
> Hi Everyone,
>
> Hope you¹re well! I¹m new to the list, however I just wanted to
> state I¹m really happy with what I¹ve seen with OpenNLP so far.
> My team and I have built a Tika Parser [1] that uses OpenNLP¹s
> location NER model, along with a Lucene Geo Names Gazeeteer [2]
> to create a ³GeoTopicParser². We are improving it day to day.
> OpenNLP has definitely come a long way from when I looked at it
> years ago in its nascence.
>
> Do you use the old sourcefoge location model?
>
> That said, I keep hearing from people I talk to in the NLP community
> that for example OpenNLP is ³old², and that I should be looking
> at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
> issues (NER is GPL as an example and I am only interested in ALv2
> or permissive license code), I don¹t have a great answer to whether
> or not OpenNLP is old or not active, or not as good, etc. Can
> devs on this list help me answer that question? I¹d like to be
> able to tell these NLP people the next time I talk to them that
> no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
> these X and Y lines of development, and here¹s where they are
> going, etc.
>
> One of the big issues is that OpenNLP only works well if the model is
> suitable for the use case of the user. The pre-trained models are
> usually not and people tend to judge us by the performance of those old
> models.
>
> People who benchmark NLP software often put OpenNLP in a bad light by
> comparing different solutions based on their pre-trained models. In my
> opinion a fair comparison of statistical NLP software is only possible
> if they all use exactly the same training data as done in one of the
> many shared tasks, otherwise it is like comparing lap times of two cars
> driving on two different race tracks.
>
> I personally use OpenNLP a lot at work and it runs in many of our
> customized NLP pipelines to analyze incoming texts. In those pipelines
> it is often used in a highly customized setup and trained on the data
> that is processed by that pipeline.
>
> I think that must be its single biggest strength. It can be easily used
> and customized, almost everything can be done without much hassle.
> All components have a custom factory that can be used to change
> fundamentally how components are put together, including custom code at
> almost any place.
>
> OpenNLP used to be maxent only, but that changed quite some time ago,
> perceptron was added years back and now we just received a naive bayes
> contribution and since 1.6.0 there is plug-able machine learning
> support. Existing machine learning libraries such as mallet can be
> integrated and used by all components, e.g training a mallet crf or
> maxent based model is now possible (mallet integration can be found in
> the sandbox).
>
> Also small things are happening all the time. The name finder now has
> support for word cluster dictionaries which can make all the difference
> between a badly and very well performing model.
>
> Other NLP libraries such as the mentioned Stanford NER are missing
> features. OpenNLP comes with built-in evolution support and can
> benchmark itself on a dataset by using cross validation, this includes
> tools which output detailed information about recognition mistakes, it
> has built-in support for many different corpora, such as OntoNotes,
> CONLL or user created brat corpora.
>
> The main difference between OpenNLP and Stanford NER / NLTK is probably
> that OpenNLP is developed for usage in production systems and not for
> researching or teaching purposes.
>
> Jörn
>
>
>
>

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Reply via email to