Il giorno gio 12 nov 2015 alle ore 20:50 Jason Baldridge < [email protected]> ha scritto:
> As one of the people who got OpenNLP started in the late 1990's (for > research, but hoping it could be used by industry), it makes me smile to > know that lots of people use it happily to this day. :) > > There are lots of new kids in town, but the licensing is often conflicted, > and the biggest benefits often come---as Joern mentions---by having the > right data to train your classifier. > > Having said that, there is a lot of activity in the deep learning space, > where old techniques (neural nets) are now viable in ways they weren't > previously, and they are outperforming linear classifiers in task after > task. I'm currently looking at Deeplearning4J, and it would be great to > have OpenNLP or a project like it make solid NLP models available based on > deep learning methods, especially LSTMs and Convolutional Neural Nets. > Deeplearning4J is Java/Scala friendly and it is ASL, so that's at least > setting off on the right foot. > > http://deeplearning4j.org/ > > The ND4J library (based on Numpy) that was built to support DL4J is also > likely to be useful for other Java projects that use machine learning. > +1, thanks Jason, it's indeed an interesting field we should look into. Another interesting technique based on neural networks is the one related to word vectors (aka word embeddings) [1]. I agree with Joern it'd be interesting to see if we can provide an integration with DL4J. Regards, Tommaso [1] : https://en.wikipedia.org/wiki/Word_embedding > > -Jason > > On Thu, 12 Nov 2015 at 09:44 Russ, Daniel (NIH/CIT) [E] < > [email protected]> > wrote: > > > Chris, > > Joern is correct. However, If I can slightly disagree on a few minor > > points. > > > > 1) I use the old sourceforge models. I find that the source of error in > > my analysis are usually not do to mistakes in sentence detection or POS > > tagging. I don’t have the annotated data or the time/money to build > custom > > models. Yes, the text I analyze is quite different than the (WSJ? or > what > > corpus was used to build the models), but it is good enough. > > > > 2) MaxEnt is still good classifier for NLP, and L-BFGS is just a > > algorithm to calculate the weights for the features. It is an > improvement > > on GIS, not a different classifier. I am not familiar enough with CRF’s > to > > comment, but the seminal paper by Della Peitra (IEEE trans part anal and > > Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt. The > > Stanford NLP groups have ppt lectures online explain why discriminative > > classification methods (e.g. MaxEnt) work better then generative (Naive > > Bayes) models (see > > https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf > ; > > particularly the example with traffic lights) > > > > As I briefly mentioned earlier. OpenNLP is a mature product. It has > > undergone some MAJOR upgrades. It is not obsolete. As for the other > > tools/libraries, they are also fine products. I use the Stanford parser > to > > get dependency information. OpenNLP just does not do it. I don’t use > NTLK > > because I need to. If the need arises, I will. I assume that you don’t > > have the time and money to learn every new NLP product. I would say play > > to your strengths. If you know the package use it. Don’t change because > > it’s trendy. > > > > > > > > Daniel Russ, Ph.D. > > Staff Scientist, Division of Computational Bioscience > > Center for Information Technology > > National Institutes of Health > > U.S. Department of Health and Human Services > > 12 South Drive > > Bethesda, MD 20892-5624 > > > > On Nov 11, 2015, at 4:41 PM, Joern Kottmann <[email protected]<mailto: > > [email protected]>> wrote: > > > > Hello, > > > > > > > > It is definitely true that OpenNLP exists for a long time (more than 10 > > years), but that doesn't mean it wasn't improved. Actually it changed a > > lot in that period. > > > > The core strength of OpenNLP was always that it can be used really easy > > to perform one of the supported NLP tasks. > > > > This was further improved with the 1.5 release adding model packages > > that ensure that the components are always instantiated correctly across > > different runtime environments. > > > > The problem is that the system used to perform the training of a model > > and the system used to run it can be quite different. Prior 1.5 it was > > possible to get that wrong, which resulted in hard to notice performance > > problems. > > I suspect that is an issue many of the competing solutions still have > > today. > > > > An example is the usage of String.toLowerCase(). The output out it > > depends on the platform local. > > > > One of the things that got dated a bit was the machine learning part of > > OpneNLP, this was addressed by adding more algorithms (e.g. perceptron > > and L-BFGS maxent). In addition the machine learning part is now > > plugable and can easily be switched against a different implementation > > for testing or production use. The sandbox contains an experimental > > mallet integration which offers all the mallet classifiers even CRF can > > be used. > > > > On Fri, 2015-11-06 at 16:54 +0000, Mattmann, Chris A (3980) wrote: > > Hi Everyone, > > > > Hope you¹re well! I¹m new to the list, however I just wanted to > > state I¹m really happy with what I¹ve seen with OpenNLP so far. > > My team and I have built a Tika Parser [1] that uses OpenNLP¹s > > location NER model, along with a Lucene Geo Names Gazeeteer [2] > > to create a ³GeoTopicParser². We are improving it day to day. > > OpenNLP has definitely come a long way from when I looked at it > > years ago in its nascence. > > > > Do you use the old sourcefoge location model? > > > > That said, I keep hearing from people I talk to in the NLP community > > that for example OpenNLP is ³old², and that I should be looking > > at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license > > issues (NER is GPL as an example and I am only interested in ALv2 > > or permissive license code), I don¹t have a great answer to whether > > or not OpenNLP is old or not active, or not as good, etc. Can > > devs on this list help me answer that question? I¹d like to be > > able to tell these NLP people the next time I talk to them that > > no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are > > these X and Y lines of development, and here¹s where they are > > going, etc. > > > > One of the big issues is that OpenNLP only works well if the model is > > suitable for the use case of the user. The pre-trained models are > > usually not and people tend to judge us by the performance of those old > > models. > > > > People who benchmark NLP software often put OpenNLP in a bad light by > > comparing different solutions based on their pre-trained models. In my > > opinion a fair comparison of statistical NLP software is only possible > > if they all use exactly the same training data as done in one of the > > many shared tasks, otherwise it is like comparing lap times of two cars > > driving on two different race tracks. > > > > I personally use OpenNLP a lot at work and it runs in many of our > > customized NLP pipelines to analyze incoming texts. In those pipelines > > it is often used in a highly customized setup and trained on the data > > that is processed by that pipeline. > > > > I think that must be its single biggest strength. It can be easily used > > and customized, almost everything can be done without much hassle. > > All components have a custom factory that can be used to change > > fundamentally how components are put together, including custom code at > > almost any place. > > > > OpenNLP used to be maxent only, but that changed quite some time ago, > > perceptron was added years back and now we just received a naive bayes > > contribution and since 1.6.0 there is plug-able machine learning > > support. Existing machine learning libraries such as mallet can be > > integrated and used by all components, e.g training a mallet crf or > > maxent based model is now possible (mallet integration can be found in > > the sandbox). > > > > Also small things are happening all the time. The name finder now has > > support for word cluster dictionaries which can make all the difference > > between a badly and very well performing model. > > > > Other NLP libraries such as the mentioned Stanford NER are missing > > features. OpenNLP comes with built-in evolution support and can > > benchmark itself on a dataset by using cross validation, this includes > > tools which output detailed information about recognition mistakes, it > > has built-in support for many different corpora, such as OntoNotes, > > CONLL or user created brat corpora. > > > > The main difference between OpenNLP and Stanford NER / NLTK is probably > > that OpenNLP is developed for usage in production systems and not for > > researching or teaching purposes. > > > > Jörn > > > > > > > > >
