Re: [VOTE] Apache OpenNLP 1.8.1 Release Candidate 3

2017-07-05 Thread Suneel Marthi
+1 binding

1. Ran the complete suite of Eval tests - all passed
2. Built from {source} * {tar, zip} - all unit tests pass
3. verified sigs and hashes



On Wed, Jul 5, 2017 at 9:21 AM, Suneel Marthi  wrote:

> The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP 1.8.1
> Release Candidate 3.
>
> The Release artifacts can be downloaded from:
>
> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1016/org/apache/opennlp/opennlp-distr/1.8.1/
>
> The release was made from the Apache OpenNLP 1.8.1 tag at
>
> https://github.com/apache/opennlp/tree/opennlp-1.8.1
>
> To use it in a maven build set the version for opennlp-tools or opennlp-uima
> to 1.8.1
>
> and add the following URL to your settings.xml file:
>
> https://repository.apache.org/content/repositories/orgapacheopennlp-1016/
>
> The artifacts have been signed with the Key - D3541808 found at
>
> http://people.apache.org/keys/group/opennlp.asc
>
> Please vote on releasing these packages as Apache OpenNLP 1.8.1. The vote
>  is
>
> open for the next 72 hours *ending on Saturday, July 8AM EST *.
>
> Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the
>
> release candidate and voice their approval or disapproval. The vote passes
>
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache OpenNLP 1.8.1
>
> [ ] -1 Do not release the packages because...
>
> Thanks again to all the committers and contributors for their work over
> the past few weeks.
>


Re: Joining the group

2017-07-05 Thread Joern Kottmann
I spent some time on the coref component and updated it to now use
OpenNLP 1.6.0, it can also be trained on MUC 6. Not sure if the models
are any good, we need to work on evaluation for that.

We created an OpenNLP Improvement Proposal (NIP) to get it into a
better shape again.

The document is still empty, but it will be located here:
https://cwiki.apache.org/confluence/display/OPENNLP/NIP-3%3A+Revive+the+coreference+component

Jörn

On Thu, Jun 29, 2017 at 7:14 PM, Joern Kottmann  wrote:
> Hello,
>
> there are a few problems we have with it. It would be very good if you
> can help us to solve those.
>
> Basically we would need to get it into the following state:
> - Have a data set it can be trained on
> - Implement evaluation for it
> - Write some documentation
>
> As far as I remember we somehow got stuck with getting it trained correctly.
>
> If we get it into a state where we can train a working model we can
> include it again in our main release.
>
> Jörn
>
>
>
>
> On Thu, Jun 29, 2017 at 7:09 PM, Ashkan Gholamzadeh
>  wrote:
>> Hi,
>>
>> I have been using coreference resolution package in 1.5.3 recently and had 
>> some good experience with it. My understanding is that it’s not supported 
>> anymore in recent version of OpenNLP. I am wondering if I can join the group 
>> to work to put back coreference along with new word net lib package back 
>> into  latest version as I think there are lots of people out there that want 
>> to use it. I used it in a multi-threaded environment and there are some work 
>> that need to be done to make it thread safe. It’s much faster than Stanford 
>> CoreNLP that I used for coreference and accuracy is comparable. There are 
>> certain functionalities like finding the most representative entity that can 
>> be added to current package to enhance its current functionality. Training a 
>> new model would also be something that can be done to improve the accuracy 
>> of algorithm.
>>
>> Please advice,
>>
>> Ashkan
>>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Joern Kottmann
It would be really great if you could implement doccat format support
for the Stanford Large Moview Review dataset, that way we can also
easily train the normal doccat component with it. We should open a
jira for that.

Jörn

On Wed, Jul 5, 2017 at 7:29 PM, Thamme Gowda  wrote:
> Got it, Thanks. We will do it.
>
> On Jul 5, 2017 9:43 AM, "Chris Mattmann"  wrote:
>
> Thanks Thamme.
>
> Please train on the datasets for sentiment analysis described here so we
> can align
> with the standard DocCat training I’m doing for sentiment analysis post
> 1.8.1.
>
> http://irds.usc.edu/SentimentAnalysisParser/datasets.html
>
> Thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 9:34 AM, "Thamme Gowda"  wrote:
>
> @Tomasso  @Jörn
> Thanks. I will update the PR by making it implement Doccat API.
>
> @Rodrigo
> I have not yet tested on the full Stanford Large Movie Review dataset.
> It
> takes more time to train, perhaps a few days for multiple passes on the
> entire dataset (on my i5 CPU, no GPUs at the moment).
> I had trained models (multiple times) with 3000 examples (1500 pos, 1500
> neg)  for two epochs, the F1 was approximately 0.70.
> I plan to train on the complete dataset sometime down the line and tune
> the
> network with more layers (that is the fun part). This PR is like
> setting up
> the infrastructure for it.
>
> @Chris
> Hi Prof. Thanks for the kind words! Just getting started with my new job
> here - more NLP and Machine Translation stuff to come.
>
> -Thamme
>
> On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann 
> wrote:
>
> > Thamme, great job!
> >
> > (proud academic dad)
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > On 7/5/17, 12:31 AM, "Joern Kottmann"  wrote:
> >
> > +1 to merge this when it implements the Document Categorizer,
> then we
> > can also use those tools to train and evaluate it
> >
> > Jörn
> >
> > On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri >
> > wrote:
> > > Hello again,
> > >
> > > @Thamme, out of curiosity, do you have evaluation numbers on the
> > > Stanford Large Movie Review dataset?
> > >
> > > Best,
> > >
> > > Rodrigo
> > >
> > > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <
> rage...@apache.org>
> > wrote:
> > >> +1 to Tommaso's comment. This would be very nice to have in the
> > project.
> > >>
> > >> R
> > >>
> > >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
> > >>  wrote:
> > >>> thanks Thamme for bringing this to the list!
> > >>>
> > >>>
> > >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
> > tgow...@gmail.com> ha
> > >>> scritto:
> > >>>
> >  Hello OpenNLP Devs,
> > 
> >  I am working with text classification using word embeddings
> like
> >  Gloves/Word2Vec and LSTM networks.
> >  It will be interesting to see if we can use it as document
> > categorizer,
> >  especially for sentiment analysis in OpenNLP.
> > 
> >  I have already raised a PR to the sandbox repo -
> >  https://github.com/apache/opennlp-sandbox/pull/3
> > 
> >  This is first version, and I expect to receive feedback from
> Dev
> > community
> >  to make it work for everyone.
> > 
> >  Here are the design choices I have made for the initial
> version:
> > 
> > - Using pre-trained Gloves - I felt the glove vector
> format is
> > clean,
> > easily customizable in terms of dimensions and vocabulary
> > size, and
> >  (also I
> > have been reading a lot about them from Stanford NLP
> group).
> >    - Training Gloves isnt hard either, we can do it using
> the
> > original C
> >    library as well as by using DL4J.
> >    - Using DL4J's Multi layer networks with LSTM instead
> of
> > reinventing
> > this stuff again on JVM for OpenNLP
> > 
> > 
> >  Please share your feedback here or on the github page
> >  https://github.com/apache/opennlp-sandbox/pull/3 .
> > 
> > 
> > >>> I think the approach outlined here sounds good, I think we
> could
> > >>> incorporate the PR as soon as it implements the Doccat API.
> > >>> Then we may see whether and how it makes sense to adjust it
> to use
> > other
> > >>> types of embeddings (e.g. paragraph vectors) and / or
> different
> > network
> > >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
> > >>>
> > >>> Looking forward to see this move forward,
>

Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Thamme Gowda
Got it, Thanks. We will do it.

On Jul 5, 2017 9:43 AM, "Chris Mattmann"  wrote:

Thanks Thamme.

Please train on the datasets for sentiment analysis described here so we
can align
with the standard DocCat training I’m doing for sentiment analysis post
1.8.1.

http://irds.usc.edu/SentimentAnalysisParser/datasets.html

Thanks!

Cheers,
Chris




On 7/5/17, 9:34 AM, "Thamme Gowda"  wrote:

@Tomasso  @Jörn
Thanks. I will update the PR by making it implement Doccat API.

@Rodrigo
I have not yet tested on the full Stanford Large Movie Review dataset.
It
takes more time to train, perhaps a few days for multiple passes on the
entire dataset (on my i5 CPU, no GPUs at the moment).
I had trained models (multiple times) with 3000 examples (1500 pos, 1500
neg)  for two epochs, the F1 was approximately 0.70.
I plan to train on the complete dataset sometime down the line and tune
the
network with more layers (that is the fun part). This PR is like
setting up
the infrastructure for it.

@Chris
Hi Prof. Thanks for the kind words! Just getting started with my new job
here - more NLP and Machine Translation stuff to come.

-Thamme

On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann 
wrote:

> Thamme, great job!
>
> (proud academic dad)
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 12:31 AM, "Joern Kottmann"  wrote:
>
> +1 to merge this when it implements the Document Categorizer,
then we
> can also use those tools to train and evaluate it
>
> Jörn
>
> On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri 
> wrote:
> > Hello again,
> >
> > @Thamme, out of curiosity, do you have evaluation numbers on the
> > Stanford Large Movie Review dataset?
> >
> > Best,
> >
> > Rodrigo
> >
> > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <
rage...@apache.org>
> wrote:
> >> +1 to Tommaso's comment. This would be very nice to have in the
> project.
> >>
> >> R
> >>
> >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
> >>  wrote:
> >>> thanks Thamme for bringing this to the list!
> >>>
> >>>
> >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
> tgow...@gmail.com> ha
> >>> scritto:
> >>>
>  Hello OpenNLP Devs,
> 
>  I am working with text classification using word embeddings
like
>  Gloves/Word2Vec and LSTM networks.
>  It will be interesting to see if we can use it as document
> categorizer,
>  especially for sentiment analysis in OpenNLP.
> 
>  I have already raised a PR to the sandbox repo -
>  https://github.com/apache/opennlp-sandbox/pull/3
> 
>  This is first version, and I expect to receive feedback from
Dev
> community
>  to make it work for everyone.
> 
>  Here are the design choices I have made for the initial
version:
> 
> - Using pre-trained Gloves - I felt the glove vector
format is
> clean,
> easily customizable in terms of dimensions and vocabulary
> size, and
>  (also I
> have been reading a lot about them from Stanford NLP
group).
>    - Training Gloves isnt hard either, we can do it using
the
> original C
>    library as well as by using DL4J.
>    - Using DL4J's Multi layer networks with LSTM instead
of
> reinventing
> this stuff again on JVM for OpenNLP
> 
> 
>  Please share your feedback here or on the github page
>  https://github.com/apache/opennlp-sandbox/pull/3 .
> 
> 
> >>> I think the approach outlined here sounds good, I think we
could
> >>> incorporate the PR as soon as it implements the Doccat API.
> >>> Then we may see whether and how it makes sense to adjust it
to use
> other
> >>> types of embeddings (e.g. paragraph vectors) and / or
different
> network
> >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
> >>>
> >>> Looking forward to see this move forward,
> >>> Regards,
> >>> Tommaso
> >>>
> >>>
> 
>  Thanks,
>  TG
> 
> 
>  --
>  *Thamme Gowda *
>  @thammegowda  |
>  http://scf.usc.edu/~tnarayan/
>  ~Sent via somebody's Webmail server
> 
>
>
>
>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Chris Mattmann
Thanks Thamme.

Please train on the datasets for sentiment analysis described here so we can 
align
with the standard DocCat training I’m doing for sentiment analysis post 1.8.1.

http://irds.usc.edu/SentimentAnalysisParser/datasets.html 

Thanks!

Cheers,
Chris




On 7/5/17, 9:34 AM, "Thamme Gowda"  wrote:

@Tomasso  @Jörn
Thanks. I will update the PR by making it implement Doccat API.

@Rodrigo
I have not yet tested on the full Stanford Large Movie Review dataset. It
takes more time to train, perhaps a few days for multiple passes on the
entire dataset (on my i5 CPU, no GPUs at the moment).
I had trained models (multiple times) with 3000 examples (1500 pos, 1500
neg)  for two epochs, the F1 was approximately 0.70.
I plan to train on the complete dataset sometime down the line and tune the
network with more layers (that is the fun part). This PR is like setting up
the infrastructure for it.

@Chris
Hi Prof. Thanks for the kind words! Just getting started with my new job
here - more NLP and Machine Translation stuff to come.

-Thamme

On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann  wrote:

> Thamme, great job!
>
> (proud academic dad)
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 12:31 AM, "Joern Kottmann"  wrote:
>
> +1 to merge this when it implements the Document Categorizer, then we
> can also use those tools to train and evaluate it
>
> Jörn
>
> On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri 
> wrote:
> > Hello again,
> >
> > @Thamme, out of curiosity, do you have evaluation numbers on the
> > Stanford Large Movie Review dataset?
> >
> > Best,
> >
> > Rodrigo
> >
> > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri 
> wrote:
> >> +1 to Tommaso's comment. This would be very nice to have in the
> project.
> >>
> >> R
> >>
> >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
> >>  wrote:
> >>> thanks Thamme for bringing this to the list!
> >>>
> >>>
> >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
> tgow...@gmail.com> ha
> >>> scritto:
> >>>
>  Hello OpenNLP Devs,
> 
>  I am working with text classification using word embeddings like
>  Gloves/Word2Vec and LSTM networks.
>  It will be interesting to see if we can use it as document
> categorizer,
>  especially for sentiment analysis in OpenNLP.
> 
>  I have already raised a PR to the sandbox repo -
>  https://github.com/apache/opennlp-sandbox/pull/3
> 
>  This is first version, and I expect to receive feedback from Dev
> community
>  to make it work for everyone.
> 
>  Here are the design choices I have made for the initial version:
> 
> - Using pre-trained Gloves - I felt the glove vector format is
> clean,
> easily customizable in terms of dimensions and vocabulary
> size, and
>  (also I
> have been reading a lot about them from Stanford NLP group).
>    - Training Gloves isnt hard either, we can do it using the
> original C
>    library as well as by using DL4J.
>    - Using DL4J's Multi layer networks with LSTM instead of
> reinventing
> this stuff again on JVM for OpenNLP
> 
> 
>  Please share your feedback here or on the github page
>  https://github.com/apache/opennlp-sandbox/pull/3 .
> 
> 
> >>> I think the approach outlined here sounds good, I think we could
> >>> incorporate the PR as soon as it implements the Doccat API.
> >>> Then we may see whether and how it makes sense to adjust it to use
> other
> >>> types of embeddings (e.g. paragraph vectors) and / or different
> network
> >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
> >>>
> >>> Looking forward to see this move forward,
> >>> Regards,
> >>> Tommaso
> >>>
> >>>
> 
>  Thanks,
>  TG
> 
> 
>  --
>  *Thamme Gowda *
>  @thammegowda  |
>  http://scf.usc.edu/~tnarayan/
>  ~Sent via somebody's Webmail server
> 
>
>
>
>





Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Thamme Gowda
@Tomasso  @Jörn
Thanks. I will update the PR by making it implement Doccat API.

@Rodrigo
I have not yet tested on the full Stanford Large Movie Review dataset. It
takes more time to train, perhaps a few days for multiple passes on the
entire dataset (on my i5 CPU, no GPUs at the moment).
I had trained models (multiple times) with 3000 examples (1500 pos, 1500
neg)  for two epochs, the F1 was approximately 0.70.
I plan to train on the complete dataset sometime down the line and tune the
network with more layers (that is the fun part). This PR is like setting up
the infrastructure for it.

@Chris
Hi Prof. Thanks for the kind words! Just getting started with my new job
here - more NLP and Machine Translation stuff to come.

-Thamme

On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann  wrote:

> Thamme, great job!
>
> (proud academic dad)
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 12:31 AM, "Joern Kottmann"  wrote:
>
> +1 to merge this when it implements the Document Categorizer, then we
> can also use those tools to train and evaluate it
>
> Jörn
>
> On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri 
> wrote:
> > Hello again,
> >
> > @Thamme, out of curiosity, do you have evaluation numbers on the
> > Stanford Large Movie Review dataset?
> >
> > Best,
> >
> > Rodrigo
> >
> > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri 
> wrote:
> >> +1 to Tommaso's comment. This would be very nice to have in the
> project.
> >>
> >> R
> >>
> >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
> >>  wrote:
> >>> thanks Thamme for bringing this to the list!
> >>>
> >>>
> >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
> tgow...@gmail.com> ha
> >>> scritto:
> >>>
>  Hello OpenNLP Devs,
> 
>  I am working with text classification using word embeddings like
>  Gloves/Word2Vec and LSTM networks.
>  It will be interesting to see if we can use it as document
> categorizer,
>  especially for sentiment analysis in OpenNLP.
> 
>  I have already raised a PR to the sandbox repo -
>  https://github.com/apache/opennlp-sandbox/pull/3
> 
>  This is first version, and I expect to receive feedback from Dev
> community
>  to make it work for everyone.
> 
>  Here are the design choices I have made for the initial version:
> 
> - Using pre-trained Gloves - I felt the glove vector format is
> clean,
> easily customizable in terms of dimensions and vocabulary
> size, and
>  (also I
> have been reading a lot about them from Stanford NLP group).
>    - Training Gloves isnt hard either, we can do it using the
> original C
>    library as well as by using DL4J.
>    - Using DL4J's Multi layer networks with LSTM instead of
> reinventing
> this stuff again on JVM for OpenNLP
> 
> 
>  Please share your feedback here or on the github page
>  https://github.com/apache/opennlp-sandbox/pull/3 .
> 
> 
> >>> I think the approach outlined here sounds good, I think we could
> >>> incorporate the PR as soon as it implements the Doccat API.
> >>> Then we may see whether and how it makes sense to adjust it to use
> other
> >>> types of embeddings (e.g. paragraph vectors) and / or different
> network
> >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
> >>>
> >>> Looking forward to see this move forward,
> >>> Regards,
> >>> Tommaso
> >>>
> >>>
> 
>  Thanks,
>  TG
> 
> 
>  --
>  *Thamme Gowda *
>  @thammegowda  |
>  http://scf.usc.edu/~tnarayan/
>  ~Sent via somebody's Webmail server
> 
>
>
>
>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Chris Mattmann
Thamme, great job! 

(proud academic dad)

Cheers,
Chris




On 7/5/17, 12:31 AM, "Joern Kottmann"  wrote:

+1 to merge this when it implements the Document Categorizer, then we
can also use those tools to train and evaluate it

Jörn

On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri  wrote:
> Hello again,
>
> @Thamme, out of curiosity, do you have evaluation numbers on the
> Stanford Large Movie Review dataset?
>
> Best,
>
> Rodrigo
>
> On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri  wrote:
>> +1 to Tommaso's comment. This would be very nice to have in the project.
>>
>> R
>>
>> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>>  wrote:
>>> thanks Thamme for bringing this to the list!
>>>
>>>
>>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda 
 ha
>>> scritto:
>>>
 Hello OpenNLP Devs,

 I am working with text classification using word embeddings like
 Gloves/Word2Vec and LSTM networks.
 It will be interesting to see if we can use it as document categorizer,
 especially for sentiment analysis in OpenNLP.

 I have already raised a PR to the sandbox repo -
 https://github.com/apache/opennlp-sandbox/pull/3

 This is first version, and I expect to receive feedback from Dev 
community
 to make it work for everyone.

 Here are the design choices I have made for the initial version:

- Using pre-trained Gloves - I felt the glove vector format is 
clean,
easily customizable in terms of dimensions and vocabulary size, and
 (also I
have been reading a lot about them from Stanford NLP group).
   - Training Gloves isnt hard either, we can do it using the 
original C
   library as well as by using DL4J.
   - Using DL4J's Multi layer networks with LSTM instead of 
reinventing
this stuff again on JVM for OpenNLP


 Please share your feedback here or on the github page
 https://github.com/apache/opennlp-sandbox/pull/3 .


>>> I think the approach outlined here sounds good, I think we could
>>> incorporate the PR as soon as it implements the Doccat API.
>>> Then we may see whether and how it makes sense to adjust it to use other
>>> types of embeddings (e.g. paragraph vectors) and / or different network
>>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>>>
>>> Looking forward to see this move forward,
>>> Regards,
>>> Tommaso
>>>
>>>

 Thanks,
 TG


 --
 *Thamme Gowda *
 @thammegowda  |
 http://scf.usc.edu/~tnarayan/
 ~Sent via somebody's Webmail server






[VOTE] Apache OpenNLP 1.8.1 Release Candidate 3

2017-07-05 Thread Suneel Marthi
The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP 1.8.1
Release Candidate 3.

The Release artifacts can be downloaded from:

https://repository.apache.org/content/repositories/orgapacheopennlp-1016/org/apache/opennlp/opennlp-distr/1.8.1/

The release was made from the Apache OpenNLP 1.8.1 tag at

https://github.com/apache/opennlp/tree/opennlp-1.8.1

To use it in a maven build set the version for opennlp-tools or opennlp-uima
to 1.8.1

and add the following URL to your settings.xml file:

https://repository.apache.org/content/repositories/orgapacheopennlp-1016/

The artifacts have been signed with the Key - D3541808 found at

http://people.apache.org/keys/group/opennlp.asc

Please vote on releasing these packages as Apache OpenNLP 1.8.1. The vote is

open for the next 72 hours *ending on Saturday, July 8AM EST *.

Only votes from OpenNLP PMC are binding, but folks are welcome to check the

release candidate and voice their approval or disapproval. The vote passes

if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache OpenNLP 1.8.1

[ ] -1 Do not release the packages because...

Thanks again to all the committers and contributors for their work
over the past
few weeks.


Re: Title: [VOTE] Apache OpenNLP 1.8.1 Release Candidate 2

2017-07-05 Thread Joern Kottmann
Lets cancel this vote. The LanguageDetectorContextGenerator class
should be public, but isn't. This will be fixed for RC 3 which will be
out a little bit later today.

Jörn

On Tue, Jul 4, 2017 at 3:55 PM, Suneel Marthi  wrote:
> +1 binding
>
> 1. Verified hashs and sigs
> 2. clean build from {src} * {tar, zip} and all tests pass
>
>
> On Tue, Jul 4, 2017 at 9:16 AM, Joern Kottmann  wrote:
>
>> Hi Folks,
>>
>>
>> I have posted a 2nd release candidate for the Apache OpenNLP 1.8.1
>> release and it is ready for testing.
>>
>>
>> The RC 2 distributables can be downloaded from here:
>> https://repository.apache.org/content/repositories/
>> orgapacheopennlp-1015/org/apache/opennlp/opennlp-distr/1.8.1/
>>
>>
>> The release was made from the Apache OpenNLP 1.8.1 tag at
>> https://github.com/apache/opennlp/tree/opennlp-1.8.1
>>
>>
>> To use it in a maven build set the version for opennlp-tools or
>> opennlp-uima to 1.8.1 and add the following URL to your settings.xml
>> file:
>> https://repository.apache.org/content/repositories/orgapacheopennlp-1015
>>
>> The release was made using the OpenNLP release process, documented on
>> the Wiki here:
>> https://cwiki.apache.org/confluence/display/OPENNLP/Release+Process
>>
>> The release contains quite some changes, please refer to the contained
>> issue list for details.
>>
>>
>> Please vote on releasing these packages as Apache OpenNLP 1.8.1. The vote
>> is
>> open for at least the next 72 hours.
>>
>>
>> Only votes from OpenNLP PMC are binding, but folks are welcome to check the
>> release candidate and voice their approval or disapproval. The vote passes
>> if at least three binding +1 votes are cast.
>>
>>
>> [ ] +1 Release the packages as Apache OpenNLP 1.8.1
>> [ ] -1 Do not release the packages because...
>>
>>
>> Thanks!
>>
>> Jörn
>>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Joern Kottmann
+1 to merge this when it implements the Document Categorizer, then we
can also use those tools to train and evaluate it

Jörn

On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri  wrote:
> Hello again,
>
> @Thamme, out of curiosity, do you have evaluation numbers on the
> Stanford Large Movie Review dataset?
>
> Best,
>
> Rodrigo
>
> On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri  wrote:
>> +1 to Tommaso's comment. This would be very nice to have in the project.
>>
>> R
>>
>> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>>  wrote:
>>> thanks Thamme for bringing this to the list!
>>>
>>>
>>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda  ha
>>> scritto:
>>>
 Hello OpenNLP Devs,

 I am working with text classification using word embeddings like
 Gloves/Word2Vec and LSTM networks.
 It will be interesting to see if we can use it as document categorizer,
 especially for sentiment analysis in OpenNLP.

 I have already raised a PR to the sandbox repo -
 https://github.com/apache/opennlp-sandbox/pull/3

 This is first version, and I expect to receive feedback from Dev community
 to make it work for everyone.

 Here are the design choices I have made for the initial version:

- Using pre-trained Gloves - I felt the glove vector format is clean,
easily customizable in terms of dimensions and vocabulary size, and
 (also I
have been reading a lot about them from Stanford NLP group).
   - Training Gloves isnt hard either, we can do it using the original C
   library as well as by using DL4J.
   - Using DL4J's Multi layer networks with LSTM instead of reinventing
this stuff again on JVM for OpenNLP


 Please share your feedback here or on the github page
 https://github.com/apache/opennlp-sandbox/pull/3 .


>>> I think the approach outlined here sounds good, I think we could
>>> incorporate the PR as soon as it implements the Doccat API.
>>> Then we may see whether and how it makes sense to adjust it to use other
>>> types of embeddings (e.g. paragraph vectors) and / or different network
>>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>>>
>>> Looking forward to see this move forward,
>>> Regards,
>>> Tommaso
>>>
>>>

 Thanks,
 TG


 --
 *Thamme Gowda *
 @thammegowda  |
 http://scf.usc.edu/~tnarayan/
 ~Sent via somebody's Webmail server



Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Rodrigo Agerri
Hello again,

@Thamme, out of curiosity, do you have evaluation numbers on the
Stanford Large Movie Review dataset?

Best,

Rodrigo

On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri  wrote:
> +1 to Tommaso's comment. This would be very nice to have in the project.
>
> R
>
> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>  wrote:
>> thanks Thamme for bringing this to the list!
>>
>>
>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda  ha
>> scritto:
>>
>>> Hello OpenNLP Devs,
>>>
>>> I am working with text classification using word embeddings like
>>> Gloves/Word2Vec and LSTM networks.
>>> It will be interesting to see if we can use it as document categorizer,
>>> especially for sentiment analysis in OpenNLP.
>>>
>>> I have already raised a PR to the sandbox repo -
>>> https://github.com/apache/opennlp-sandbox/pull/3
>>>
>>> This is first version, and I expect to receive feedback from Dev community
>>> to make it work for everyone.
>>>
>>> Here are the design choices I have made for the initial version:
>>>
>>>- Using pre-trained Gloves - I felt the glove vector format is clean,
>>>easily customizable in terms of dimensions and vocabulary size, and
>>> (also I
>>>have been reading a lot about them from Stanford NLP group).
>>>   - Training Gloves isnt hard either, we can do it using the original C
>>>   library as well as by using DL4J.
>>>   - Using DL4J's Multi layer networks with LSTM instead of reinventing
>>>this stuff again on JVM for OpenNLP
>>>
>>>
>>> Please share your feedback here or on the github page
>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>>>
>>>
>> I think the approach outlined here sounds good, I think we could
>> incorporate the PR as soon as it implements the Doccat API.
>> Then we may see whether and how it makes sense to adjust it to use other
>> types of embeddings (e.g. paragraph vectors) and / or different network
>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>>
>> Looking forward to see this move forward,
>> Regards,
>> Tommaso
>>
>>
>>>
>>> Thanks,
>>> TG
>>>
>>>
>>> --
>>> *Thamme Gowda *
>>> @thammegowda  |
>>> http://scf.usc.edu/~tnarayan/
>>> ~Sent via somebody's Webmail server
>>>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Rodrigo Agerri
+1 to Tommaso's comment. This would be very nice to have in the project.

R

On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
 wrote:
> thanks Thamme for bringing this to the list!
>
>
> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda  ha
> scritto:
>
>> Hello OpenNLP Devs,
>>
>> I am working with text classification using word embeddings like
>> Gloves/Word2Vec and LSTM networks.
>> It will be interesting to see if we can use it as document categorizer,
>> especially for sentiment analysis in OpenNLP.
>>
>> I have already raised a PR to the sandbox repo -
>> https://github.com/apache/opennlp-sandbox/pull/3
>>
>> This is first version, and I expect to receive feedback from Dev community
>> to make it work for everyone.
>>
>> Here are the design choices I have made for the initial version:
>>
>>- Using pre-trained Gloves - I felt the glove vector format is clean,
>>easily customizable in terms of dimensions and vocabulary size, and
>> (also I
>>have been reading a lot about them from Stanford NLP group).
>>   - Training Gloves isnt hard either, we can do it using the original C
>>   library as well as by using DL4J.
>>   - Using DL4J's Multi layer networks with LSTM instead of reinventing
>>this stuff again on JVM for OpenNLP
>>
>>
>> Please share your feedback here or on the github page
>> https://github.com/apache/opennlp-sandbox/pull/3 .
>>
>>
> I think the approach outlined here sounds good, I think we could
> incorporate the PR as soon as it implements the Doccat API.
> Then we may see whether and how it makes sense to adjust it to use other
> types of embeddings (e.g. paragraph vectors) and / or different network
> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>
> Looking forward to see this move forward,
> Regards,
> Tommaso
>
>
>>
>> Thanks,
>> TG
>>
>>
>> --
>> *Thamme Gowda *
>> @thammegowda  |
>> http://scf.usc.edu/~tnarayan/
>> ~Sent via somebody's Webmail server
>>


Re: Document Categorizer based on Glove + LSTM (powered by DL4J)

2017-07-05 Thread Tommaso Teofili
thanks Thamme for bringing this to the list!


Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda  ha
scritto:

> Hello OpenNLP Devs,
>
> I am working with text classification using word embeddings like
> Gloves/Word2Vec and LSTM networks.
> It will be interesting to see if we can use it as document categorizer,
> especially for sentiment analysis in OpenNLP.
>
> I have already raised a PR to the sandbox repo -
> https://github.com/apache/opennlp-sandbox/pull/3
>
> This is first version, and I expect to receive feedback from Dev community
> to make it work for everyone.
>
> Here are the design choices I have made for the initial version:
>
>- Using pre-trained Gloves - I felt the glove vector format is clean,
>easily customizable in terms of dimensions and vocabulary size, and
> (also I
>have been reading a lot about them from Stanford NLP group).
>   - Training Gloves isnt hard either, we can do it using the original C
>   library as well as by using DL4J.
>   - Using DL4J's Multi layer networks with LSTM instead of reinventing
>this stuff again on JVM for OpenNLP
>
>
> Please share your feedback here or on the github page
> https://github.com/apache/opennlp-sandbox/pull/3 .
>
>
I think the approach outlined here sounds good, I think we could
incorporate the PR as soon as it implements the Doccat API.
Then we may see whether and how it makes sense to adjust it to use other
types of embeddings (e.g. paragraph vectors) and / or different network
setups (e.g. more hidden layers, bidirectionalLSTM, etc.).

Looking forward to see this move forward,
Regards,
Tommaso


>
> Thanks,
> TG
>
>
> --
> *Thamme Gowda *
> @thammegowda  |
> http://scf.usc.edu/~tnarayan/
> ~Sent via somebody's Webmail server
>