It would be really great if you could implement doccat format support
for the Stanford Large Moview Review dataset, that way we can also
easily train the normal doccat component with it. We should open a
jira for that.

Jörn

On Wed, Jul 5, 2017 at 7:29 PM, Thamme Gowda <[email protected]> wrote:
> Got it, Thanks. We will do it.
>
> On Jul 5, 2017 9:43 AM, "Chris Mattmann" <[email protected]> wrote:
>
> Thanks Thamme.
>
> Please train on the datasets for sentiment analysis described here so we
> can align
> with the standard DocCat training I’m doing for sentiment analysis post
> 1.8.1.
>
> http://irds.usc.edu/SentimentAnalysisParser/datasets.html
>
> Thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/5/17, 9:34 AM, "Thamme Gowda" <[email protected]> wrote:
>
>     @Tomasso  @Jörn
>     Thanks. I will update the PR by making it implement Doccat API.
>
>     @Rodrigo
>     I have not yet tested on the full Stanford Large Movie Review dataset.
> It
>     takes more time to train, perhaps a few days for multiple passes on the
>     entire dataset (on my i5 CPU, no GPUs at the moment).
>     I had trained models (multiple times) with 3000 examples (1500 pos, 1500
>     neg)  for two epochs, the F1 was approximately 0.70.
>     I plan to train on the complete dataset sometime down the line and tune
> the
>     network with more layers (that is the fun part). This PR is like
> setting up
>     the infrastructure for it.
>
>     @Chris
>     Hi Prof. Thanks for the kind words! Just getting started with my new job
>     here - more NLP and Machine Translation stuff to come.
>
>     -Thamme
>
>     On Wed, Jul 5, 2017 at 8:26 AM, Chris Mattmann <[email protected]>
> wrote:
>
>     > Thamme, great job!
>     >
>     > (proud academic dad)
>     >
>     > Cheers,
>     > Chris
>     >
>     >
>     >
>     >
>     > On 7/5/17, 12:31 AM, "Joern Kottmann" <[email protected]> wrote:
>     >
>     >     +1 to merge this when it implements the Document Categorizer,
> then we
>     >     can also use those tools to train and evaluate it
>     >
>     >     Jörn
>     >
>     >     On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <[email protected]
>>
>     > wrote:
>     >     > Hello again,
>     >     >
>     >     > @Thamme, out of curiosity, do you have evaluation numbers on the
>     >     > Stanford Large Movie Review dataset?
>     >     >
>     >     > Best,
>     >     >
>     >     > Rodrigo
>     >     >
>     >     > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <
> [email protected]>
>     > wrote:
>     >     >> +1 to Tommaso's comment. This would be very nice to have in the
>     > project.
>     >     >>
>     >     >> R
>     >     >>
>     >     >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili
>     >     >> <[email protected]> wrote:
>     >     >>> thanks Thamme for bringing this to the list!
>     >     >>>
>     >     >>>
>     >     >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <
>     > [email protected]> ha
>     >     >>> scritto:
>     >     >>>
>     >     >>>> Hello OpenNLP Devs,
>     >     >>>>
>     >     >>>> I am working with text classification using word embeddings
> like
>     >     >>>> Gloves/Word2Vec and LSTM networks.
>     >     >>>> It will be interesting to see if we can use it as document
>     > categorizer,
>     >     >>>> especially for sentiment analysis in OpenNLP.
>     >     >>>>
>     >     >>>> I have already raised a PR to the sandbox repo -
>     >     >>>> https://github.com/apache/opennlp-sandbox/pull/3
>     >     >>>>
>     >     >>>> This is first version, and I expect to receive feedback from
> Dev
>     > community
>     >     >>>> to make it work for everyone.
>     >     >>>>
>     >     >>>> Here are the design choices I have made for the initial
> version:
>     >     >>>>
>     >     >>>>    - Using pre-trained Gloves - I felt the glove vector
> format is
>     > clean,
>     >     >>>>    easily customizable in terms of dimensions and vocabulary
>     > size, and
>     >     >>>> (also I
>     >     >>>>    have been reading a lot about them from Stanford NLP
> group).
>     >     >>>>       - Training Gloves isnt hard either, we can do it using
> the
>     > original C
>     >     >>>>       library as well as by using DL4J.
>     >     >>>>       - Using DL4J's Multi layer networks with LSTM instead
> of
>     > reinventing
>     >     >>>>    this stuff again on JVM for OpenNLP
>     >     >>>>
>     >     >>>>
>     >     >>>> Please share your feedback here or on the github page
>     >     >>>> https://github.com/apache/opennlp-sandbox/pull/3 .
>     >     >>>>
>     >     >>>>
>     >     >>> I think the approach outlined here sounds good, I think we
> could
>     >     >>> incorporate the PR as soon as it implements the Doccat API.
>     >     >>> Then we may see whether and how it makes sense to adjust it
> to use
>     > other
>     >     >>> types of embeddings (e.g. paragraph vectors) and / or
> different
>     > network
>     >     >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.).
>     >     >>>
>     >     >>> Looking forward to see this move forward,
>     >     >>> Regards,
>     >     >>> Tommaso
>     >     >>>
>     >     >>>
>     >     >>>>
>     >     >>>> Thanks,
>     >     >>>> TG
>     >     >>>>
>     >     >>>>
>     >     >>>> --
>     >     >>>> *Thamme Gowda *
>     >     >>>> @thammegowda <https://twitter.com/thammegowda> |
>     >     >>>> http://scf.usc.edu/~tnarayan/
>     >     >>>> ~Sent via somebody's Webmail server
>     >     >>>>
>     >
>     >
>     >
>     >

Reply via email to