Hello Nick,

thanks for your feedback.

It would be very nice if you could help us improve things. The doccat
component is used by many users I know, and I am sure they would
benefit from your help.

Yes, you are supposed to implement your own
ObjectStream<DocumentSample> in case the default is not good for some
reason.

And we should extend the manual about how to pass the
TrainingParameters (we should check that for every component).

Happy you found it useful anyway, and lets see if you can address your
points with manual updates, and code changes to make training easier
are also always useful.

Jörn

On Wed, Feb 6, 2019 at 8:09 PM Nick Burch <[email protected]> wrote:
>
> Hi All
>
> Last week, I took part in a hackathon for Alfresco, the open source
> content management system, and as part of that were having a play with
> integrating Sentement Analysis [1]. As the Standford CoreNLP has sentement
> analysis built it, we first used that. Then I tried to use Apache OpenNLP
> instead. This wasn't that easy, but ended up working better for our test
> documents.
>
> I figured it might be good to share my experiences, in case there's things
> I could improve, or in case there's documentation / examples / etc that
> could be improved!
>
>
> So, first up, the approach. I couldn't find anything in the docs on
> sentement analysis. So, I decided to try using the Document Classifier,
> and feed it two categories to learn/predict on, positive and negative. Is
> that the best route?
>
> (I did find a 2016 GSOC project to add sentement analysis, but decided to
> stick with just core OpenNLP code)
>
>
> Next, I hit a snag - the code at
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat
> doesn't compile against 1.9.1. I've raised
> https://issues.apache.org/jira/browse/OPENNLP-1237 for this.
>
>
> Having guessed at the new API syntax, I then needed to feed in some
> training data. Based on [2], I opted for the JHU amazon review data. Not
> sure if there are better free datasets for English language sentiment?
>
>
> Next snag - the data format. The JHU data isn't in the same format as the
> training tool or PlainTextByLineStream expects. What's more, I couldn't
> find any examples of an alternate DocumentSampleStream input or
> ObjectStream<DocumentSample> in the manual. Is there one? Is there
> anything else on writing your own? Should there be?
>
> (I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal,
> and probably could be much improved, suggestions welcome!)
>
>
> Next challenge - TrainingParameters. Several blog posts I found on using
> the DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn't
> spot anything in the manual under Document Categorizer for parameters,
> though other sections did have them. Did I miss it? Should there be
> something in the manual?
>
>
> Building the model was nice and quick, and getting predictions easy too,
> which was good! However, with my (quite possibly wrong) plan of training
> for two categories, Positive or Negative, I wasn't able to see how to get
> a good "how much sentiment" out. I opted for just returning whichever
> category was reported as best, with no score (since typically the two
> categories came back with very similar scores, though one generally
> slightly higher than the other). Is there a better way?
>
>
> Finally, it did all work, and for our testing better than StanfordNLP, so
> thanks everyone for the library :)
>
> Thanks
> Nick
>
> [1] https://github.com/Alfresco/SentimentAnalysis
> [2] 
> https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
> [3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
> [4] 
> https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy

Reply via email to