Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Tommaso Teofili Fri, 13 Nov 2015 08:51:02 -0800

Il giorno gio 12 nov 2015 alle ore 20:50 Jason Baldridge <
[email protected]> ha scritto:


> As one of the people who got OpenNLP started in the late 1990's (for
> research, but hoping it could be used by industry), it makes me smile to
> know that lots of people use it happily to this day. :)
>
> There are lots of new kids in town, but the licensing is often conflicted,
> and the biggest benefits often come---as Joern mentions---by having the
> right data to train your classifier.
>
> Having said that, there is a lot of activity in the deep learning space,
> where old techniques (neural nets) are now viable in ways they weren't
> previously, and they are outperforming linear classifiers in task after
> task. I'm currently looking at Deeplearning4J, and it would be great to
> have OpenNLP or a project like it make solid NLP models available based on
> deep learning methods, especially LSTMs and Convolutional Neural Nets.
> Deeplearning4J is Java/Scala friendly and it is ASL, so that's at least
> setting off on the right foot.
>
> http://deeplearning4j.org/
>
> The ND4J library (based on Numpy) that was built to support DL4J is also
> likely to be useful for other Java projects that use machine learning.
>

+1, thanks Jason, it's indeed an interesting field we should look into.
Another interesting technique based on neural networks is the one related
to word vectors (aka word embeddings) [1].

I agree with Joern it'd be interesting to see if we can provide an
integration with DL4J.

Regards,
Tommaso

[1] : https://en.wikipedia.org/wiki/Word_embedding


>
> -Jason
>
> On Thu, 12 Nov 2015 at 09:44 Russ, Daniel (NIH/CIT) [E] <
> [email protected]>
> wrote:
>
> > Chris,
> >     Joern is correct.  However, If I can slightly disagree on a few minor
> > points.
> >
> > 1) I use the old sourceforge models.  I find that the source of error in
> > my analysis are usually not do to mistakes in sentence detection or POS
> > tagging.  I don’t have the annotated data or the time/money to build
> custom
> > models.  Yes, the text I analyze is quite different than the (WSJ? or
> what
> > corpus was used to build the models), but it is good enough.
> >
> > 2)  MaxEnt is still good classifier for NLP, and L-BFGS is just a
> > algorithm to calculate the weights for the features.  It is an
> improvement
> > on GIS, not a different classifier.  I am not familiar enough with CRF’s
> to
> > comment, but the seminal paper by Della Peitra (IEEE trans part anal and
> > Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt.  The
> > Stanford NLP groups have ppt lectures online explain why discriminative
> > classification methods (e.g. MaxEnt) work better then generative (Naive
> > Bayes) models (see
> > https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf
> ;
> > particularly the example with traffic lights)
> >
> > As I briefly mentioned earlier.  OpenNLP is a mature product.  It has
> > undergone some MAJOR upgrades.  It is not obsolete.  As for the other
> > tools/libraries, they are also fine products.  I use the Stanford parser
> to
> > get dependency information.  OpenNLP just does not do it.  I don’t use
> NTLK
> > because I need to.  If the need arises, I will.  I assume that you don’t
> > have the time and money to learn every new NLP product.  I would say play
> > to your strengths. If you know the package use it. Don’t change because
> > it’s trendy.
> >
> >
> >
> > Daniel Russ, Ph.D.
> > Staff Scientist, Division of Computational Bioscience
> > Center for Information Technology
> > National Institutes of Health
> > U.S. Department of Health and Human Services
> > 12 South Drive
> > Bethesda,  MD 20892-5624
> >
> > On Nov 11, 2015, at 4:41 PM, Joern Kottmann <[email protected]<mailto:
> > [email protected]>> wrote:
> >
> > Hello,
> >
> >
> >
> > It is definitely true that OpenNLP exists for a long time (more than 10
> > years), but that doesn't mean it wasn't improved. Actually it changed a
> > lot in that period.
> >
> > The core strength of OpenNLP was always that it can be used really easy
> > to perform one of the supported NLP tasks.
> >
> > This was further improved with the 1.5 release adding model packages
> > that ensure that the components are always instantiated correctly across
> > different runtime environments.
> >
> > The problem is that the system used to perform the training of a model
> > and the system used to run it can be quite different. Prior 1.5 it was
> > possible to get that wrong, which resulted in hard to notice performance
> > problems.
> > I suspect that is an issue many of the competing solutions still have
> > today.
> >
> > An example is the usage of String.toLowerCase(). The output out it
> > depends on the platform local.
> >
> > One of the things that got dated a bit was the machine learning part of
> > OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
> > and L-BFGS maxent). In addition the machine learning part is now
> > plugable and can easily be switched against a different implementation
> > for testing or production use. The sandbox contains an experimental
> > mallet integration which offers all the mallet classifiers even CRF can
> > be used.
> >
> > On Fri, 2015-11-06 at 16:54 +0000, Mattmann, Chris A (3980) wrote:
> > Hi Everyone,
> >
> > Hope you¹re well! I¹m new to the list, however I just wanted to
> > state I¹m really happy with what I¹ve seen with OpenNLP so far.
> > My team and I have built a Tika Parser [1] that uses OpenNLP¹s
> > location NER model, along with a Lucene Geo Names Gazeeteer [2]
> > to create a ³GeoTopicParser². We are improving it day to day.
> > OpenNLP has definitely come a long way from when I looked at it
> > years ago in its nascence.
> >
> > Do you use the old sourcefoge location model?
> >
> > That said, I keep hearing from people I talk to in the NLP community
> > that for example OpenNLP is ³old², and that I should be looking
> > at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
> > issues (NER is GPL as an example and I am only interested in ALv2
> > or permissive license code), I don¹t have a great answer to whether
> > or not OpenNLP is old or not active, or not as good, etc. Can
> > devs on this list help me answer that question? I¹d like to be
> > able to tell these NLP people the next time I talk to them that
> > no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
> > these X and Y lines of development, and here¹s where they are
> > going, etc.
> >
> > One of the big issues is that OpenNLP only works well if the model is
> > suitable for the use case of the user. The pre-trained models are
> > usually not and people tend to judge us by the performance of those old
> > models.
> >
> > People who benchmark NLP software often put OpenNLP in a bad light by
> > comparing different solutions based on their pre-trained models. In my
> > opinion a fair comparison of statistical NLP software is only possible
> > if they all use exactly the same training data as done in one of the
> > many shared tasks, otherwise it is like comparing lap times of two cars
> > driving on two different race tracks.
> >
> > I personally use OpenNLP a lot at work and it runs in many of our
> > customized NLP pipelines to analyze incoming texts. In those pipelines
> > it is often used in a highly customized setup and trained on the data
> > that is processed by that pipeline.
> >
> > I think that must be its single biggest strength. It can be easily used
> > and customized, almost everything can be done without much hassle.
> > All components have a custom factory that can be used to change
> > fundamentally how components are put together, including custom code at
> > almost any place.
> >
> > OpenNLP used to be maxent only, but that changed quite some time ago,
> > perceptron was added years back and now we just received a naive bayes
> > contribution and since 1.6.0 there is plug-able machine learning
> > support. Existing machine learning libraries such as mallet can be
> > integrated and used by all components, e.g training a mallet crf or
> > maxent based model is now possible (mallet integration can be found in
> > the sandbox).
> >
> > Also small things are happening all the time. The name finder now has
> > support for word cluster dictionaries which can make all the difference
> > between a badly and very well performing model.
> >
> > Other NLP libraries such as the mentioned Stanford NER are missing
> > features. OpenNLP comes with built-in evolution support and can
> > benchmark itself on a dataset by using cross validation, this includes
> > tools which output detailed information about recognition mistakes, it
> > has built-in support for many different corpora, such as OntoNotes,
> > CONLL or user created brat corpora.
> >
> > The main difference between OpenNLP and Stanford NER / NLTK is probably
> > that OpenNLP is developed for usage in production systems and not for
> > researching or teaching purposes.
> >
> > Jörn
> >
> >
> >
> >
>

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Reply via email to