Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Joern Kottmann
On Thu, 2015-11-12 at 15:43 +, Russ, Daniel (NIH/CIT) [E] wrote:
> 1) I use the old sourceforge models.  I find that the source of error
> in my analysis are usually not do to mistakes in sentence detection or
> POS tagging.  I don’t have the annotated data or the time/money to
> build custom models.  Yes, the text I analyze is quite different than
> the (WSJ? or what corpus was used to build the models), but it is good
> enough. 

That is interesting, wasn't aware of that those are still useful.

It really depends on the component as well, I was mostly thinking about
the name finder models when I wrote that.

Do you only use the Sentence Detector, Tokenizer and POS tagger?

You could use OntoNotes (almost for free) to train models. Maybe we
should look into distributing models trained on OntoNotes.

Jörn



signature.asc
Description: This is a digitally signed message part


Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Jason Baldridge
As one of the people who got OpenNLP started in the late 1990's (for
research, but hoping it could be used by industry), it makes me smile to
know that lots of people use it happily to this day. :)

There are lots of new kids in town, but the licensing is often conflicted,
and the biggest benefits often come---as Joern mentions---by having the
right data to train your classifier.

Having said that, there is a lot of activity in the deep learning space,
where old techniques (neural nets) are now viable in ways they weren't
previously, and they are outperforming linear classifiers in task after
task. I'm currently looking at Deeplearning4J, and it would be great to
have OpenNLP or a project like it make solid NLP models available based on
deep learning methods, especially LSTMs and Convolutional Neural Nets.
Deeplearning4J is Java/Scala friendly and it is ASL, so that's at least
setting off on the right foot.

http://deeplearning4j.org/

The ND4J library (based on Numpy) that was built to support DL4J is also
likely to be useful for other Java projects that use machine learning.

-Jason

On Thu, 12 Nov 2015 at 09:44 Russ, Daniel (NIH/CIT) [E] 
wrote:

> Chris,
> Joern is correct.  However, If I can slightly disagree on a few minor
> points.
>
> 1) I use the old sourceforge models.  I find that the source of error in
> my analysis are usually not do to mistakes in sentence detection or POS
> tagging.  I don’t have the annotated data or the time/money to build custom
> models.  Yes, the text I analyze is quite different than the (WSJ? or what
> corpus was used to build the models), but it is good enough.
>
> 2)  MaxEnt is still good classifier for NLP, and L-BFGS is just a
> algorithm to calculate the weights for the features.  It is an improvement
> on GIS, not a different classifier.  I am not familiar enough with CRF’s to
> comment, but the seminal paper by Della Peitra (IEEE trans part anal and
> Mach intel, v19 no4 1997) make it appear as an extension of MaxEnt.  The
> Stanford NLP groups have ppt lectures online explain why discriminative
> classification methods (e.g. MaxEnt) work better then generative (Naive
> Bayes) models (see
> https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf;
> particularly the example with traffic lights)
>
> As I briefly mentioned earlier.  OpenNLP is a mature product.  It has
> undergone some MAJOR upgrades.  It is not obsolete.  As for the other
> tools/libraries, they are also fine products.  I use the Stanford parser to
> get dependency information.  OpenNLP just does not do it.  I don’t use NTLK
> because I need to.  If the need arises, I will.  I assume that you don’t
> have the time and money to learn every new NLP product.  I would say play
> to your strengths. If you know the package use it. Don’t change because
> it’s trendy.
>
>
>
> Daniel Russ, Ph.D.
> Staff Scientist, Division of Computational Bioscience
> Center for Information Technology
> National Institutes of Health
> U.S. Department of Health and Human Services
> 12 South Drive
> Bethesda,  MD 20892-5624
>
> On Nov 11, 2015, at 4:41 PM, Joern Kottmann > wrote:
>
> Hello,
>
>
>
> It is definitely true that OpenNLP exists for a long time (more than 10
> years), but that doesn't mean it wasn't improved. Actually it changed a
> lot in that period.
>
> The core strength of OpenNLP was always that it can be used really easy
> to perform one of the supported NLP tasks.
>
> This was further improved with the 1.5 release adding model packages
> that ensure that the components are always instantiated correctly across
> different runtime environments.
>
> The problem is that the system used to perform the training of a model
> and the system used to run it can be quite different. Prior 1.5 it was
> possible to get that wrong, which resulted in hard to notice performance
> problems.
> I suspect that is an issue many of the competing solutions still have
> today.
>
> An example is the usage of String.toLowerCase(). The output out it
> depends on the platform local.
>
> One of the things that got dated a bit was the machine learning part of
> OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
> and L-BFGS maxent). In addition the machine learning part is now
> plugable and can easily be switched against a different implementation
> for testing or production use. The sandbox contains an experimental
> mallet integration which offers all the mallet classifiers even CRF can
> be used.
>
> On Fri, 2015-11-06 at 16:54 +, Mattmann, Chris A (3980) wrote:
> Hi Everyone,
>
> Hope you¹re well! I¹m new to the list, however I just wanted to
> state I¹m really happy with what I¹ve seen with OpenNLP so far.
> My team and I have built a Tika Parser [1] that uses OpenNLP¹s
> location NER model, along with a Lucene Geo Names Gazeeteer [2]
> to create a ³GeoTopicParser². We are improving it day to day.
> OpenNLP has definitely come 

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Mattmann, Chris A (3980)
also we should look at TensorFlow too Continuum just packaged it up
as a conda package too through our Memex project.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Joern Kottmann <kottm...@gmail.com>
Reply-To: <dev@opennlp.apache.org>
Date: Thursday, November 12, 2015 at 5:18 PM
To: <dev@opennlp.apache.org>
Subject: Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford
NER, etc.

>On Thu, 2015-11-12 at 19:50 +, Jason Baldridge wrote:
>> Having said that, there is a lot of activity in the deep learning
>> space,
>> where old techniques (neural nets) are now viable in ways they weren't
>> previously, and they are outperforming linear classifiers in task
>> after
>> task. I'm currently looking at Deeplearning4J, and it would be great
>> to
>> have OpenNLP or a project like it make solid NLP models available
>> based on
>> deep learning methods, especially LSTMs and Convolutional Neural Nets.
>> Deeplearning4J is Java/Scala friendly and it is ASL, so that's at
>> least
>> setting off on the right foot.
>> 
>> http://deeplearning4j.org/
>
>
>I hope I can find a bit of time to write an integration for it.
>Thanks for sharing!
>
>Jörn
>



Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Joern Kottmann
On Thu, 2015-11-12 at 19:50 +, Jason Baldridge wrote:
> Having said that, there is a lot of activity in the deep learning
> space,
> where old techniques (neural nets) are now viable in ways they weren't
> previously, and they are outperforming linear classifiers in task
> after
> task. I'm currently looking at Deeplearning4J, and it would be great
> to
> have OpenNLP or a project like it make solid NLP models available
> based on
> deep learning methods, especially LSTMs and Convolutional Neural Nets.
> Deeplearning4J is Java/Scala friendly and it is ASL, so that's at
> least
> setting off on the right foot.
> 
> http://deeplearning4j.org/


I hope I can find a bit of time to write an integration for it.
Thanks for sharing!

Jörn



signature.asc
Description: This is a digitally signed message part


Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Mattmann, Chris A (3980)
Yes I think it’s critical that we also distribute models and have
e.g., things like brew packages and so forth so they are install
to install. Imagine:

# brew install opennlp —with-models

I’ll start working on that.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Joern Kottmann <kottm...@gmail.com>
Reply-To: <dev@opennlp.apache.org>
Date: Thursday, November 12, 2015 at 5:22 PM
To: <dev@opennlp.apache.org>
Subject: Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford
NER, etc.

>On Thu, 2015-11-12 at 15:43 +, Russ, Daniel (NIH/CIT) [E] wrote:
>> 1) I use the old sourceforge models.  I find that the source of error
>> in my analysis are usually not do to mistakes in sentence detection or
>> POS tagging.  I don’t have the annotated data or the time/money to
>> build custom models.  Yes, the text I analyze is quite different than
>> the (WSJ? or what corpus was used to build the models), but it is good
>> enough. 
>
>That is interesting, wasn't aware of that those are still useful.
>
>It really depends on the component as well, I was mostly thinking about
>the name finder models when I wrote that.
>
>Do you only use the Sentence Detector, Tokenizer and POS tagger?
>
>You could use OntoNotes (almost for free) to train models. Maybe we
>should look into distributing models trained on OntoNotes.
>
>Jörn
>



Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Russ, Daniel (NIH/CIT) [E]
Chris,
Joern is correct.  However, If I can slightly disagree on a few minor 
points.

1) I use the old sourceforge models.  I find that the source of error in my 
analysis are usually not do to mistakes in sentence detection or POS tagging.  
I don’t have the annotated data or the time/money to build custom models.  Yes, 
the text I analyze is quite different than the (WSJ? or what corpus was used to 
build the models), but it is good enough.

2)  MaxEnt is still good classifier for NLP, and L-BFGS is just a algorithm to 
calculate the weights for the features.  It is an improvement on GIS, not a 
different classifier.  I am not familiar enough with CRF’s to comment, but the 
seminal paper by Della Peitra (IEEE trans part anal and Mach intel, v19 no4 
1997) make it appear as an extension of MaxEnt.  The Stanford NLP groups have 
ppt lectures online explain why discriminative classification methods (e.g. 
MaxEnt) work better then generative (Naive Bayes) models (see 
https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf; 
particularly the example with traffic lights)

As I briefly mentioned earlier.  OpenNLP is a mature product.  It has undergone 
some MAJOR upgrades.  It is not obsolete.  As for the other tools/libraries, 
they are also fine products.  I use the Stanford parser to get dependency 
information.  OpenNLP just does not do it.  I don’t use NTLK because I need to. 
 If the need arises, I will.  I assume that you don’t have the time and money 
to learn every new NLP product.  I would say play to your strengths. If you 
know the package use it. Don’t change because it’s trendy.



Daniel Russ, Ph.D.
Staff Scientist, Division of Computational Bioscience
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Nov 11, 2015, at 4:41 PM, Joern Kottmann 
> wrote:

Hello,



It is definitely true that OpenNLP exists for a long time (more than 10
years), but that doesn't mean it wasn't improved. Actually it changed a
lot in that period.

The core strength of OpenNLP was always that it can be used really easy
to perform one of the supported NLP tasks.

This was further improved with the 1.5 release adding model packages
that ensure that the components are always instantiated correctly across
different runtime environments.

The problem is that the system used to perform the training of a model
and the system used to run it can be quite different. Prior 1.5 it was
possible to get that wrong, which resulted in hard to notice performance
problems.
I suspect that is an issue many of the competing solutions still have
today.

An example is the usage of String.toLowerCase(). The output out it
depends on the platform local.

One of the things that got dated a bit was the machine learning part of
OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
and L-BFGS maxent). In addition the machine learning part is now
plugable and can easily be switched against a different implementation
for testing or production use. The sandbox contains an experimental
mallet integration which offers all the mallet classifiers even CRF can
be used.

On Fri, 2015-11-06 at 16:54 +, Mattmann, Chris A (3980) wrote:
Hi Everyone,

Hope you¹re well! I¹m new to the list, however I just wanted to
state I¹m really happy with what I¹ve seen with OpenNLP so far.
My team and I have built a Tika Parser [1] that uses OpenNLP¹s
location NER model, along with a Lucene Geo Names Gazeeteer [2]
to create a ³GeoTopicParser². We are improving it day to day.
OpenNLP has definitely come a long way from when I looked at it
years ago in its nascence.

Do you use the old sourcefoge location model?

That said, I keep hearing from people I talk to in the NLP community
that for example OpenNLP is ³old², and that I should be looking
at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
issues (NER is GPL as an example and I am only interested in ALv2
or permissive license code), I don¹t have a great answer to whether
or not OpenNLP is old or not active, or not as good, etc. Can
devs on this list help me answer that question? I¹d like to be
able to tell these NLP people the next time I talk to them that
no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
these X and Y lines of development, and here¹s where they are
going, etc.

One of the big issues is that OpenNLP only works well if the model is
suitable for the use case of the user. The pre-trained models are
usually not and people tend to judge us by the performance of those old
models.

People who benchmark NLP software often put OpenNLP in a bad light by
comparing different solutions based on their pre-trained models. In my
opinion a fair comparison of statistical NLP software is only possible
if they all use exactly the same training data as done in one of the
many shared tasks, otherwise it is like 

Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-06 Thread Mattmann, Chris A (3980)
Hi Everyone,

Hope you¹re well! I¹m new to the list, however I just wanted to
state I¹m really happy with what I¹ve seen with OpenNLP so far.
My team and I have built a Tika Parser [1] that uses OpenNLP¹s
location NER model, along with a Lucene Geo Names Gazeeteer [2]
to create a ³GeoTopicParser². We are improving it day to day.
OpenNLP has definitely come a long way from when I looked at it
years ago in its nascence.

That said, I keep hearing from people I talk to in the NLP community
that for example OpenNLP is ³old², and that I should be looking
at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
issues (NER is GPL as an example and I am only interested in ALv2
or permissive license code), I don¹t have a great answer to whether
or not OpenNLP is old or not active, or not as good, etc. Can
devs on this list help me answer that question? I¹d like to be
able to tell these NLP people the next time I talk to them that
no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
these X and Y lines of development, and here¹s where they are
going, etc.

Looking at:

https://whimsy.apache.org/board/minutes/OpenNLP


I see that you guys have done GSoC, are working on a Naive Bayes
classifier; summarization components, are making releases (seemingly
more frequently, etc.). I also looked at:

https://reporter.apache.org/


And I see your project health score and activity is excellent.

Thanks and let me know.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-06 Thread Mattmann, Chris A (3980)
Forgot the refs:

[1] http://wiki.apache.org/tika/GeoTopicParser
[2] https://github.com/chrismattmann/lucene-geo-gazetteer.git

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: jpluser <chris.a.mattm...@jpl.nasa.gov>
Reply-To: "dev@opennlp.apache.org" <dev@opennlp.apache.org>
Date: Friday, November 6, 2015 at 8:54 AM
To: "dev@opennlp.apache.org" <dev@opennlp.apache.org>
Subject: Question about OpenNLP and comparison to e.g., NTLK, Stanford
NER, etc.

>Hi Everyone,
>
>Hope you¹re well! I¹m new to the list, however I just wanted to
>state I¹m really happy with what I¹ve seen with OpenNLP so far.
>My team and I have built a Tika Parser [1] that uses OpenNLP¹s
>location NER model, along with a Lucene Geo Names Gazeeteer [2]
>to create a ³GeoTopicParser². We are improving it day to day.
>OpenNLP has definitely come a long way from when I looked at it
>years ago in its nascence.
>
>That said, I keep hearing from people I talk to in the NLP community
>that for example OpenNLP is ³old², and that I should be looking
>at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
>issues (NER is GPL as an example and I am only interested in ALv2
>or permissive license code), I don¹t have a great answer to whether
>or not OpenNLP is old or not active, or not as good, etc. Can
>devs on this list help me answer that question? I¹d like to be
>able to tell these NLP people the next time I talk to them that
>no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
>these X and Y lines of development, and here¹s where they are
>going, etc.
>
>Looking at:
>
>https://whimsy.apache.org/board/minutes/OpenNLP
>
>
>I see that you guys have done GSoC, are working on a Naive Bayes
>classifier; summarization components, are making releases (seemingly
>more frequently, etc.). I also looked at:
>
>https://reporter.apache.org/
>
>
>And I see your project health score and activity is excellent.
>
>Thanks and let me know.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>