@Tomasso @Jörn Thanks. I will update the PR by making it implement Doccat API.
@Rodrigo I have not yet tested on the full Stanford Large Movie Review dataset. It takes more time to train, perhaps a few days for multiple passes on the entire dataset (on my i5 CPU, no GPUs at the moment). I had trained models (multiple times) with 3000 examples (1500 pos, 1500 neg) for two epochs, the F1 was approximately 0.70. I plan to train on the complete dataset sometime down the line and tune the network with more layers (that is the fun part). This PR is like setting up the infrastructure for it. @Chris Hi Prof. Thanks for the kind words! Just getting started with my new job here - more NLP and Machine Translation stuff to come. -Thamme On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <[email protected]> wrote: > Thamme, great job! > > (proud academic dad) > > Cheers, > Chris > > > > > On 7/5/17, 12:31 AM, "Joern Kottmann" <[email protected]> wrote: > > +1 to merge this when it implements the Document Categorizer, then we > can also use those tools to train and evaluate it > > Jörn > > On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <[email protected]> > wrote: > > Hello again, > > > > @Thamme, out of curiosity, do you have evaluation numbers on the > > Stanford Large Movie Review dataset? > > > > Best, > > > > Rodrigo > > > > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <[email protected]> > wrote: > >> +1 to Tommaso's comment. This would be very nice to have in the > project. > >> > >> R > >> > >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili > >> <[email protected]> wrote: > >>> thanks Thamme for bringing this to the list! > >>> > >>> > >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda < > [email protected]> ha > >>> scritto: > >>> > >>>> Hello OpenNLP Devs, > >>>> > >>>> I am working with text classification using word embeddings like > >>>> Gloves/Word2Vec and LSTM networks. > >>>> It will be interesting to see if we can use it as document > categorizer, > >>>> especially for sentiment analysis in OpenNLP. > >>>> > >>>> I have already raised a PR to the sandbox repo - > >>>> https://github.com/apache/opennlp-sandbox/pull/3 > >>>> > >>>> This is first version, and I expect to receive feedback from Dev > community > >>>> to make it work for everyone. > >>>> > >>>> Here are the design choices I have made for the initial version: > >>>> > >>>> - Using pre-trained Gloves - I felt the glove vector format is > clean, > >>>> easily customizable in terms of dimensions and vocabulary > size, and > >>>> (also I > >>>> have been reading a lot about them from Stanford NLP group). > >>>> - Training Gloves isnt hard either, we can do it using the > original C > >>>> library as well as by using DL4J. > >>>> - Using DL4J's Multi layer networks with LSTM instead of > reinventing > >>>> this stuff again on JVM for OpenNLP > >>>> > >>>> > >>>> Please share your feedback here or on the github page > >>>> https://github.com/apache/opennlp-sandbox/pull/3 . > >>>> > >>>> > >>> I think the approach outlined here sounds good, I think we could > >>> incorporate the PR as soon as it implements the Doccat API. > >>> Then we may see whether and how it makes sense to adjust it to use > other > >>> types of embeddings (e.g. paragraph vectors) and / or different > network > >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.). > >>> > >>> Looking forward to see this move forward, > >>> Regards, > >>> Tommaso > >>> > >>> > >>>> > >>>> Thanks, > >>>> TG > >>>> > >>>> > >>>> -- > >>>> *Thamme Gowda * > >>>> @thammegowda <https://twitter.com/thammegowda> | > >>>> http://scf.usc.edu/~tnarayan/ > >>>> ~Sent via somebody's Webmail server > >>>> > > > >
