Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
Yes I think it’s critical that we also distribute models and have e.g., things like brew packages and so forth so they are install to install. Imagine: # brew install opennlp —with-models I’ll start working on that. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Joern Kottmann Reply-To: Date: Thursday, November 12, 2015 at 5:22 PM To: Subject: Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc. >On Thu, 2015-11-12 at 15:43 +, Russ, Daniel (NIH/CIT) [E] wrote: >> 1) I use the old sourceforge models. I find that the source of error >> in my analysis are usually not do to mistakes in sentence detection or >> POS tagging. I don’t have the annotated data or the time/money to >> build custom models. Yes, the text I analyze is quite different than >> the (WSJ? or what corpus was used to build the models), but it is good >> enough. > >That is interesting, wasn't aware of that those are still useful. > >It really depends on the component as well, I was mostly thinking about >the name finder models when I wrote that. > >Do you only use the Sentence Detector, Tokenizer and POS tagger? > >You could use OntoNotes (almost for free) to train models. Maybe we >should look into distributing models trained on OntoNotes. > >Jörn >
Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
also we should look at TensorFlow too Continuum just packaged it up as a conda package too through our Memex project. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Joern Kottmann Reply-To: Date: Thursday, November 12, 2015 at 5:18 PM To: Subject: Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc. >On Thu, 2015-11-12 at 19:50 +, Jason Baldridge wrote: >> Having said that, there is a lot of activity in the deep learning >> space, >> where old techniques (neural nets) are now viable in ways they weren't >> previously, and they are outperforming linear classifiers in task >> after >> task. I'm currently looking at Deeplearning4J, and it would be great >> to >> have OpenNLP or a project like it make solid NLP models available >> based on >> deep learning methods, especially LSTMs and Convolutional Neural Nets. >> Deeplearning4J is Java/Scala friendly and it is ASL, so that's at >> least >> setting off on the right foot. >> >> http://deeplearning4j.org/ > > >I hope I can find a bit of time to write an integration for it. >Thanks for sharing! > >Jörn >
Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
On Thu, 2015-11-12 at 15:43 +, Russ, Daniel (NIH/CIT) [E] wrote: > 1) I use the old sourceforge models. I find that the source of error > in my analysis are usually not do to mistakes in sentence detection or > POS tagging. I don’t have the annotated data or the time/money to > build custom models. Yes, the text I analyze is quite different than > the (WSJ? or what corpus was used to build the models), but it is good > enough. That is interesting, wasn't aware of that those are still useful. It really depends on the component as well, I was mostly thinking about the name finder models when I wrote that. Do you only use the Sentence Detector, Tokenizer and POS tagger? You could use OntoNotes (almost for free) to train models. Maybe we should look into distributing models trained on OntoNotes. Jörn signature.asc Description: This is a digitally signed message part
Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
On Thu, 2015-11-12 at 19:50 +, Jason Baldridge wrote: > Having said that, there is a lot of activity in the deep learning > space, > where old techniques (neural nets) are now viable in ways they weren't > previously, and they are outperforming linear classifiers in task > after > task. I'm currently looking at Deeplearning4J, and it would be great > to > have OpenNLP or a project like it make solid NLP models available > based on > deep learning methods, especially LSTMs and Convolutional Neural Nets. > Deeplearning4J is Java/Scala friendly and it is ASL, so that's at > least > setting off on the right foot. > > http://deeplearning4j.org/ I hope I can find a bit of time to write an integration for it. Thanks for sharing! Jörn signature.asc Description: This is a digitally signed message part
Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
As one of the people who got OpenNLP started in the late 1990's (for research, but hoping it could be used by industry), it makes me smile to know that lots of people use it happily to this day. :) There are lots of new kids in town, but the licensing is often conflicted, and the biggest benefits often come---as Joern mentions---by having the right data to train your classifier. Having said that, there is a lot of activity in the deep learning space, where old techniques (neural nets) are now viable in ways they weren't previously, and they are outperforming linear classifiers in task after task. I'm currently looking at Deeplearning4J, and it would be great to have OpenNLP or a project like it make solid NLP models available based on deep learning methods, especially LSTMs and Convolutional Neural Nets. Deeplearning4J is Java/Scala friendly and it is ASL, so that's at least setting off on the right foot. http://deeplearning4j.org/ The ND4J library (based on Numpy) that was built to support DL4J is also likely to be useful for other Java projects that use machine learning. -Jason On Thu, 12 Nov 2015 at 09:44 Russ, Daniel (NIH/CIT) [E] wrote: > Chris, > Joern is correct. However, If I can slightly disagree on a few minor > points. > > 1) I use the old sourceforge models. I find that the source of error in > my analysis are usually not do to mistakes in sentence detection or POS > tagging. I don’t have the annotated data or the time/money to build custom > models. Yes, the text I analyze is quite different than the (WSJ? or what > corpus was used to build the models), but it is good enough. > > 2) MaxEnt is still good classifier for NLP, and L-BFGS is just a > algorithm to calculate the weights for the features. It is an improvement > on GIS, not a different classifier. I am not familiar enough with CRF’s to > comment, but the seminal paper by Della Peitra (IEEE trans part anal and > Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt. The > Stanford NLP groups have ppt lectures online explain why discriminative > classification methods (e.g. MaxEnt) work better then generative (Naive > Bayes) models (see > https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf; > particularly the example with traffic lights) > > As I briefly mentioned earlier. OpenNLP is a mature product. It has > undergone some MAJOR upgrades. It is not obsolete. As for the other > tools/libraries, they are also fine products. I use the Stanford parser to > get dependency information. OpenNLP just does not do it. I don’t use NTLK > because I need to. If the need arises, I will. I assume that you don’t > have the time and money to learn every new NLP product. I would say play > to your strengths. If you know the package use it. Don’t change because > it’s trendy. > > > > Daniel Russ, Ph.D. > Staff Scientist, Division of Computational Bioscience > Center for Information Technology > National Institutes of Health > U.S. Department of Health and Human Services > 12 South Drive > Bethesda, MD 20892-5624 > > On Nov 11, 2015, at 4:41 PM, Joern Kottmann kottm...@gmail.com>> wrote: > > Hello, > > > > It is definitely true that OpenNLP exists for a long time (more than 10 > years), but that doesn't mean it wasn't improved. Actually it changed a > lot in that period. > > The core strength of OpenNLP was always that it can be used really easy > to perform one of the supported NLP tasks. > > This was further improved with the 1.5 release adding model packages > that ensure that the components are always instantiated correctly across > different runtime environments. > > The problem is that the system used to perform the training of a model > and the system used to run it can be quite different. Prior 1.5 it was > possible to get that wrong, which resulted in hard to notice performance > problems. > I suspect that is an issue many of the competing solutions still have > today. > > An example is the usage of String.toLowerCase(). The output out it > depends on the platform local. > > One of the things that got dated a bit was the machine learning part of > OpneNLP, this was addressed by adding more algorithms (e.g. perceptron > and L-BFGS maxent). In addition the machine learning part is now > plugable and can easily be switched against a different implementation > for testing or production use. The sandbox contains an experimental > mallet integration which offers all the mallet classifiers even CRF can > be used. > > On Fri, 2015-11-06 at 16:54 +, Mattmann, Chris A (3980) wrote: > Hi Everyone, > > Hope you¹re well! I¹m new to the list, however I just wanted to > state I¹m really happy with what I¹ve seen with OpenNLP so far. > My team and I have built a Tika Parser [1] that uses OpenNLP¹s > location NER model, along with a Lucene Geo Names Gazeeteer [2] > to create a ³GeoTopicParser². We are improving it day to day. > OpenNLP has definitely come a long way from when I looked at it > y
Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.
Chris, Joern is correct. However, If I can slightly disagree on a few minor points. 1) I use the old sourceforge models. I find that the source of error in my analysis are usually not do to mistakes in sentence detection or POS tagging. I don’t have the annotated data or the time/money to build custom models. Yes, the text I analyze is quite different than the (WSJ? or what corpus was used to build the models), but it is good enough. 2) MaxEnt is still good classifier for NLP, and L-BFGS is just a algorithm to calculate the weights for the features. It is an improvement on GIS, not a different classifier. I am not familiar enough with CRF’s to comment, but the seminal paper by Della Peitra (IEEE trans part anal and Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt. The Stanford NLP groups have ppt lectures online explain why discriminative classification methods (e.g. MaxEnt) work better then generative (Naive Bayes) models (see https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf; particularly the example with traffic lights) As I briefly mentioned earlier. OpenNLP is a mature product. It has undergone some MAJOR upgrades. It is not obsolete. As for the other tools/libraries, they are also fine products. I use the Stanford parser to get dependency information. OpenNLP just does not do it. I don’t use NTLK because I need to. If the need arises, I will. I assume that you don’t have the time and money to learn every new NLP product. I would say play to your strengths. If you know the package use it. Don’t change because it’s trendy. Daniel Russ, Ph.D. Staff Scientist, Division of Computational Bioscience Center for Information Technology National Institutes of Health U.S. Department of Health and Human Services 12 South Drive Bethesda, MD 20892-5624 On Nov 11, 2015, at 4:41 PM, Joern Kottmann mailto:kottm...@gmail.com>> wrote: Hello, It is definitely true that OpenNLP exists for a long time (more than 10 years), but that doesn't mean it wasn't improved. Actually it changed a lot in that period. The core strength of OpenNLP was always that it can be used really easy to perform one of the supported NLP tasks. This was further improved with the 1.5 release adding model packages that ensure that the components are always instantiated correctly across different runtime environments. The problem is that the system used to perform the training of a model and the system used to run it can be quite different. Prior 1.5 it was possible to get that wrong, which resulted in hard to notice performance problems. I suspect that is an issue many of the competing solutions still have today. An example is the usage of String.toLowerCase(). The output out it depends on the platform local. One of the things that got dated a bit was the machine learning part of OpneNLP, this was addressed by adding more algorithms (e.g. perceptron and L-BFGS maxent). In addition the machine learning part is now plugable and can easily be switched against a different implementation for testing or production use. The sandbox contains an experimental mallet integration which offers all the mallet classifiers even CRF can be used. On Fri, 2015-11-06 at 16:54 +, Mattmann, Chris A (3980) wrote: Hi Everyone, Hope you¹re well! I¹m new to the list, however I just wanted to state I¹m really happy with what I¹ve seen with OpenNLP so far. My team and I have built a Tika Parser [1] that uses OpenNLP¹s location NER model, along with a Lucene Geo Names Gazeeteer [2] to create a ³GeoTopicParser². We are improving it day to day. OpenNLP has definitely come a long way from when I looked at it years ago in its nascence. Do you use the old sourcefoge location model? That said, I keep hearing from people I talk to in the NLP community that for example OpenNLP is ³old², and that I should be looking at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license issues (NER is GPL as an example and I am only interested in ALv2 or permissive license code), I don¹t have a great answer to whether or not OpenNLP is old or not active, or not as good, etc. Can devs on this list help me answer that question? I¹d like to be able to tell these NLP people the next time I talk to them that no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are these X and Y lines of development, and here¹s where they are going, etc. One of the big issues is that OpenNLP only works well if the model is suitable for the use case of the user. The pre-trained models are usually not and people tend to judge us by the performance of those old models. People who benchmark NLP software often put OpenNLP in a bad light by comparing different solutions based on their pre-trained models. In my opinion a fair comparison of statistical NLP software is only possible if they all use exactly the same training data as done in one of the many shared tasks, otherwise it is like comparing lap times of